Advertisement
Guest User

Untitled

a guest
Mar 11th, 2022
167
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 8.02 KB | None | 0 0
  1. INFO:__main__:***** Running training *****
  2. INFO:__main__: Num examples = 1709784
  3. INFO:__main__: Num Epochs = 4.0
  4. INFO:__main__: Instantaneous batch size per device = 128
  5. INFO:__main__: Total train batch size = 1024
  6. Epoch 1/4
  7. Traceback (most recent call last):
  8. File "/content/scripts/run_mlm.py", line 562, in <module>
  9. main()
  10. File "/content/scripts/run_mlm.py", line 537, in main
  11. callbacks=[SavePretrainedCallback(output_dir=training_args.output_dir)],
  12. File "/usr/local/lib/python3.7/dist-packages/keras/utils/traceback_utils.py", line 67, in error_handler
  13. raise e.with_traceback(filtered_tb) from None
  14. File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/ops.py", line 1191, in _numpy
  15. raise core._status_to_exception(e) from None # pylint: disable=protected-access
  16. tensorflow.python.framework.errors_impl.InternalError: 6 root error(s) found.
  17. (0) INTERNAL: {{function_node __inference_train_function_60878}} failed to connect to all addresses
  18. Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
  19. :{"created":"@1647001906.673822119","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":3124,"referenced_errors":[{"created":"@1647001906.673821562","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/lib/transport/error_utils.cc","file_line":163,"grpc_status":14}]}
  20. [[{{node StatefulPartitionedCall}}]]
  21. [[MultiDeviceIteratorGetNextFromShard]]
  22. Executing non-communication op <MultiDeviceIteratorGetNextFromShard> originally returned UnavailableError, and was replaced by InternalError to avoid invoking TF network error handling logic.
  23. [[RemoteCall]]
  24. [[IteratorGetNextAsOptional]]
  25. [[tpu_compile_succeeded_assert/_14717574637004917986/_7/_463]]
  26. (1) INTERNAL: {{function_node __inference_train_function_60878}} failed to connect to all addresses
  27. Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
  28. :{"created":"@1647001906.673822119","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":3124,"referenced_errors":[{"created":"@1647001906.673821562","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/lib/transport/error_utils.cc","file_line":163,"grpc_status":14}]}
  29. [[{{node StatefulPartitionedCall}}]]
  30. [[MultiDeviceIteratorGetNextFromShard]]
  31. Executing non-communication op <MultiDeviceIteratorGetNextFromShard> originally returned UnavailableError, and was replaced by InternalError to avoid invoking TF network error handling logic.
  32. [[RemoteCall]]
  33. [[IteratorGetNextAsOptional]]
  34. [[cond/pivot_t/_4/_83]]
  35. (2) INTERNAL: {{function_node __inference_train_function_60878}} failed to connect to all addresses
  36. Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
  37. :{"created":"@1647001906.673822119","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":3124,"referenced_errors":[{"created":"@1647001906.673821562","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/lib/transport/error_utils.cc","file_line":163,"grpc_status":14}]}
  38. [[{{node StatefulPartitionedCall}}]]
  39. [[MultiDeviceIteratorGetNextFromShard]]
  40. Executing non-communication op <MultiDeviceIteratorGetNextFromShard> originally returned UnavailableError, and was replaced by InternalError to avoid invoking TF network error handling logic.
  41. [[RemoteCall]]
  42. [[IteratorGetNextAsOptional]]
  43. [[cond/else/_1/cond/IteratorGetNext_7/_160]]
  44. (3) INTERNAL: {{function_node __inference_train_function_60878}} failed to connect to all addresses
  45. Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CP ... [truncated]
  46. Error in atexit._run_exitfuncs:
  47. Traceback (most recent call last):
  48. File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/context.py", line 2611, in async_wait
  49. context().sync_executors()
  50. File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/context.py", line 694, in sync_executors
  51. pywrap_tfe.TFE_ContextSyncExecutors(self._context_handle)
  52. tensorflow.python.framework.errors_impl.InternalError: 6 root error(s) found.
  53. (0) INTERNAL: {{function_node __inference_train_function_60878}} failed to connect to all addresses
  54. Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
  55. :{"created":"@1647001906.673822119","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":3124,"referenced_errors":[{"created":"@1647001906.673821562","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/lib/transport/error_utils.cc","file_line":163,"grpc_status":14}]}
  56. [[{{node StatefulPartitionedCall}}]]
  57. [[MultiDeviceIteratorGetNextFromShard]]
  58. Executing non-communication op <MultiDeviceIteratorGetNextFromShard> originally returned UnavailableError, and was replaced by InternalError to avoid invoking TF network error handling logic.
  59. [[RemoteCall]]
  60. [[IteratorGetNextAsOptional]]
  61. [[tpu_compile_succeeded_assert/_14717574637004917986/_7/_463]]
  62. (1) INTERNAL: {{function_node __inference_train_function_60878}} failed to connect to all addresses
  63. Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
  64. :{"created":"@1647001906.673822119","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":3124,"referenced_errors":[{"created":"@1647001906.673821562","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/lib/transport/error_utils.cc","file_line":163,"grpc_status":14}]}
  65. [[{{node StatefulPartitionedCall}}]]
  66. [[MultiDeviceIteratorGetNextFromShard]]
  67. Executing non-communication op <MultiDeviceIteratorGetNextFromShard> originally returned UnavailableError, and was replaced by InternalError to avoid invoking TF network error handling logic.
  68. [[RemoteCall]]
  69. [[IteratorGetNextAsOptional]]
  70. [[cond/pivot_t/_4/_83]]
  71. (2) INTERNAL: {{function_node __inference_train_function_60878}} failed to connect to all addresses
  72. Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
  73. :{"created":"@1647001906.673822119","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":3124,"referenced_errors":[{"created":"@1647001906.673821562","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/lib/transport/error_utils.cc","file_line":163,"grpc_status":14}]}
  74. [[{{node StatefulPartitionedCall}}]]
  75. [[MultiDeviceIteratorGetNextFromShard]]
  76. Executing non-communication op <MultiDeviceIteratorGetNextFromShard> originally returned UnavailableError, and was replaced by InternalError to avoid invoking TF network error handling logic.
  77. [[RemoteCall]]
  78. [[IteratorGetNextAsOptional]]
  79. [[cond/else/_1/cond/IteratorGetNext_7/_160]]
  80. (3) INTERNAL: {{function_node __inference_train_function_60878}} failed to connect to all addresses
  81. Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CP ... [truncated]
  82. 2022-03-11 12:31:47.260230: W ./tensorflow/core/distributed_runtime/eager/destroy_tensor_handle_node.h:57] Ignoring an error encountered when deleting remote tensors handles: INVALID_ARGUMENT: Unable to find the relevant tensor remote_handle: Op ID: 30285, Output num: 0
  83. Additional GRPC error information from remote target /job:worker/replica:0/task:0:
  84. :{"created":"@1647001907.256787801","description":"Error received from peer ipv4:10.62.127.2:8470","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Unable to find the relevant tensor remote_handle: Op ID: 30285, Output num: 0","grpc_status":3}
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement