Advertisement
Not a member of Pastebin yet?
Sign Up,
it unlocks many cool features!
- INFO:__main__:***** Running training *****
- INFO:__main__: Num examples = 1709784
- INFO:__main__: Num Epochs = 4.0
- INFO:__main__: Instantaneous batch size per device = 128
- INFO:__main__: Total train batch size = 1024
- Epoch 1/4
- Traceback (most recent call last):
- File "/content/scripts/run_mlm.py", line 562, in <module>
- main()
- File "/content/scripts/run_mlm.py", line 537, in main
- callbacks=[SavePretrainedCallback(output_dir=training_args.output_dir)],
- File "/usr/local/lib/python3.7/dist-packages/keras/utils/traceback_utils.py", line 67, in error_handler
- raise e.with_traceback(filtered_tb) from None
- File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/ops.py", line 1191, in _numpy
- raise core._status_to_exception(e) from None # pylint: disable=protected-access
- tensorflow.python.framework.errors_impl.InternalError: 6 root error(s) found.
- (0) INTERNAL: {{function_node __inference_train_function_60878}} failed to connect to all addresses
- Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
- :{"created":"@1647001906.673822119","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":3124,"referenced_errors":[{"created":"@1647001906.673821562","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/lib/transport/error_utils.cc","file_line":163,"grpc_status":14}]}
- [[{{node StatefulPartitionedCall}}]]
- [[MultiDeviceIteratorGetNextFromShard]]
- Executing non-communication op <MultiDeviceIteratorGetNextFromShard> originally returned UnavailableError, and was replaced by InternalError to avoid invoking TF network error handling logic.
- [[RemoteCall]]
- [[IteratorGetNextAsOptional]]
- [[tpu_compile_succeeded_assert/_14717574637004917986/_7/_463]]
- (1) INTERNAL: {{function_node __inference_train_function_60878}} failed to connect to all addresses
- Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
- :{"created":"@1647001906.673822119","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":3124,"referenced_errors":[{"created":"@1647001906.673821562","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/lib/transport/error_utils.cc","file_line":163,"grpc_status":14}]}
- [[{{node StatefulPartitionedCall}}]]
- [[MultiDeviceIteratorGetNextFromShard]]
- Executing non-communication op <MultiDeviceIteratorGetNextFromShard> originally returned UnavailableError, and was replaced by InternalError to avoid invoking TF network error handling logic.
- [[RemoteCall]]
- [[IteratorGetNextAsOptional]]
- [[cond/pivot_t/_4/_83]]
- (2) INTERNAL: {{function_node __inference_train_function_60878}} failed to connect to all addresses
- Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
- :{"created":"@1647001906.673822119","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":3124,"referenced_errors":[{"created":"@1647001906.673821562","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/lib/transport/error_utils.cc","file_line":163,"grpc_status":14}]}
- [[{{node StatefulPartitionedCall}}]]
- [[MultiDeviceIteratorGetNextFromShard]]
- Executing non-communication op <MultiDeviceIteratorGetNextFromShard> originally returned UnavailableError, and was replaced by InternalError to avoid invoking TF network error handling logic.
- [[RemoteCall]]
- [[IteratorGetNextAsOptional]]
- [[cond/else/_1/cond/IteratorGetNext_7/_160]]
- (3) INTERNAL: {{function_node __inference_train_function_60878}} failed to connect to all addresses
- Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CP ... [truncated]
- Error in atexit._run_exitfuncs:
- Traceback (most recent call last):
- File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/context.py", line 2611, in async_wait
- context().sync_executors()
- File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/context.py", line 694, in sync_executors
- pywrap_tfe.TFE_ContextSyncExecutors(self._context_handle)
- tensorflow.python.framework.errors_impl.InternalError: 6 root error(s) found.
- (0) INTERNAL: {{function_node __inference_train_function_60878}} failed to connect to all addresses
- Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
- :{"created":"@1647001906.673822119","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":3124,"referenced_errors":[{"created":"@1647001906.673821562","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/lib/transport/error_utils.cc","file_line":163,"grpc_status":14}]}
- [[{{node StatefulPartitionedCall}}]]
- [[MultiDeviceIteratorGetNextFromShard]]
- Executing non-communication op <MultiDeviceIteratorGetNextFromShard> originally returned UnavailableError, and was replaced by InternalError to avoid invoking TF network error handling logic.
- [[RemoteCall]]
- [[IteratorGetNextAsOptional]]
- [[tpu_compile_succeeded_assert/_14717574637004917986/_7/_463]]
- (1) INTERNAL: {{function_node __inference_train_function_60878}} failed to connect to all addresses
- Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
- :{"created":"@1647001906.673822119","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":3124,"referenced_errors":[{"created":"@1647001906.673821562","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/lib/transport/error_utils.cc","file_line":163,"grpc_status":14}]}
- [[{{node StatefulPartitionedCall}}]]
- [[MultiDeviceIteratorGetNextFromShard]]
- Executing non-communication op <MultiDeviceIteratorGetNextFromShard> originally returned UnavailableError, and was replaced by InternalError to avoid invoking TF network error handling logic.
- [[RemoteCall]]
- [[IteratorGetNextAsOptional]]
- [[cond/pivot_t/_4/_83]]
- (2) INTERNAL: {{function_node __inference_train_function_60878}} failed to connect to all addresses
- Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
- :{"created":"@1647001906.673822119","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":3124,"referenced_errors":[{"created":"@1647001906.673821562","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/lib/transport/error_utils.cc","file_line":163,"grpc_status":14}]}
- [[{{node StatefulPartitionedCall}}]]
- [[MultiDeviceIteratorGetNextFromShard]]
- Executing non-communication op <MultiDeviceIteratorGetNextFromShard> originally returned UnavailableError, and was replaced by InternalError to avoid invoking TF network error handling logic.
- [[RemoteCall]]
- [[IteratorGetNextAsOptional]]
- [[cond/else/_1/cond/IteratorGetNext_7/_160]]
- (3) INTERNAL: {{function_node __inference_train_function_60878}} failed to connect to all addresses
- Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CP ... [truncated]
- 2022-03-11 12:31:47.260230: W ./tensorflow/core/distributed_runtime/eager/destroy_tensor_handle_node.h:57] Ignoring an error encountered when deleting remote tensors handles: INVALID_ARGUMENT: Unable to find the relevant tensor remote_handle: Op ID: 30285, Output num: 0
- Additional GRPC error information from remote target /job:worker/replica:0/task:0:
- :{"created":"@1647001907.256787801","description":"Error received from peer ipv4:10.62.127.2:8470","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Unable to find the relevant tensor remote_handle: Op ID: 30285, Output num: 0","grpc_status":3}
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement