Advertisement
Guest User

error.log

a guest
Nov 24th, 2021
80
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
  1. "\uFFED""\uFFED"
  2. [1]<stderr>:  eval_features_file: gen_enfr/data_src.valid.tok
  3. [2]<stderr>:  decoding_subword_token_is_spacer: false
  4. [1]<stderr>:  eval_labels_file: gen_enfr/data_tgt.valid.tok
  5. [2]<stderr>:  label_smoothing: 0.1
  6. [1]<stderr>:  source_vocabulary: gen_enfr/bpe_src.vocab
  7. [2]<stderr>:  learning_rate: 1.0
  8. [1]<stderr>:  target_vocabulary: gen_enfr/bpe_tgt.vocab
  9. [2]<stderr>:  length_penalty: 0.6
  10. [1]<stderr>:  train_features_file: gen_enfr/data_src.train.tok
  11. [2]<stderr>:  max_margin_eta: 0.1
  12. [1]<stderr>:  train_labels_file: gen_enfr/data_tgt.train.tok
  13. [2]<stderr>:  maximum_decoding_length: 256
  14. [1]<stderr>:eval:
  15. [2]<stderr>:  num_hypotheses: 1
  16. [1]<stderr>:  batch_size: 32
  17. [2]<stderr>:  optimizer: Adam
  18. [1]<stderr>:  batch_type: examples
  19. [2]<stderr>:  optimizer_params:
  20. [1]<stderr>:  early_stopping:
  21. [2]<stderr>:    beta_1: 0.9
  22. [1]<stderr>:    metric: bleu
  23. [2]<stderr>:    beta_2: 0.998
  24. [1]<stderr>:    min_improvement: 0.01
  25. [2]<stderr>:score:
  26. [1]<stderr>:    steps: 3
  27. [2]<stderr>:  batch_size: 64
  28. [1]<stderr>:  external_evaluators: BLEU
  29. [2]<stderr>:  batch_type: examples
  30. [1]<stderr>:  length_bucket_width: 5
  31. [2]<stderr>:  length_bucket_width: 5
  32. [1]<stderr>:  save_eval_predictions: false
  33. [2]<stderr>:train:
  34. [1]<stderr>:  steps: 5000
  35. [2]<stderr>:  average_last_checkpoints: 8
  36. [1]<stderr>:infer:
  37. [2]<stderr>:  batch_size: 4096
  38. [1]<stderr>:  batch_size: 32
  39. [2]<stderr>:  batch_type: tokens
  40. [1]<stderr>:  batch_type: examples
  41. [2]<stderr>:  effective_batch_size: 25000
  42. [1]<stderr>:  bucket_width: 5
  43. [2]<stderr>:  keep_checkpoint_max: 8
  44. [1]<stderr>:  length_bucket_width: 5
  45. [2]<stderr>:  length_bucket_width: 1
  46. [1]<stderr>:model_dir: gen_enfr/run
  47. [2]<stderr>:  max_step: 200000
  48. [1]<stderr>:params:
  49. [2]<stderr>:  maximum_features_length: 256
  50. [1]<stderr>:  average_loss_in_time: true
  51. [2]<stderr>:  maximum_labels_length: 256
  52. [1]<stderr>:  beam_width: 2
  53. [2]<stderr>:  moving_average_decay: 0.9999
  54. [1]<stderr>:  contrastive_learning: false
  55. [2]<stderr>:  replace_unknown_target: true
  56. [1]<stderr>:  coverage_penalty: 0
  57. [2]<stderr>:  sample_buffer_size: 500000
  58. [1]<stderr>:  decay_params:
  59. [2]<stderr>:  save_checkpoints_steps: 1000
  60. [1]<stderr>:    model_dim: 1024
  61. [2]<stderr>:  save_summary_steps: 200
  62. [1]<stderr>:    warmup_steps: 8000
  63. [2]<stderr>:  single_pass: false
  64. [1]<stderr>:  decay_type: NoamDecay
  65. [2]<stderr>:
  66. [1]<stderr>:  decoding_subword_token: "\uFFED"
  67. [1]<stderr>:  decoding_subword_token_is_spacer: false
  68. [1]<stderr>:  label_smoothing: 0.1
  69. [1]<stderr>:  learning_rate: 1.0
  70. [1]<stderr>:  length_penalty: 0.6
  71. [1]<stderr>:  max_margin_eta: 0.1
  72. [1]<stderr>:  maximum_decoding_length: 256
  73. [1]<stderr>:  num_hypotheses: 1
  74. [1]<stderr>:  optimizer: Adam
  75. [1]<stderr>:  optimizer_params:
  76. [1]<stderr>:    beta_1: 0.9
  77. [1]<stderr>:    beta_2: 0.998
  78. [1]<stderr>:score:
  79. [1]<stderr>:  batch_size: 64
  80. [1]<stderr>:  batch_type: examples
  81. [1]<stderr>:  length_bucket_width: 5
  82. [1]<stderr>:train:
  83. [1]<stderr>:  average_last_checkpoints: 8
  84. [1]<stderr>:  batch_size: 4096
  85. [1]<stderr>:  batch_type: tokens
  86. [1]<stderr>:  effective_batch_size: 25000
  87. [1]<stderr>:  keep_checkpoint_max: 8
  88. [1]<stderr>:  length_bucket_width: 1
  89. [1]<stderr>:  max_step: 200000
  90. [1]<stderr>:  maximum_features_length: 256
  91. [1]<stderr>:  maximum_labels_length: 256
  92. [1]<stderr>:  moving_average_decay: 0.9999
  93. [1]<stderr>:  replace_unknown_target: true
  94. [1]<stderr>:  sample_buffer_size: 500000
  95. [1]<stderr>:  save_checkpoints_steps: 1000
  96. [1]<stderr>:  save_summary_steps: 200
  97. [1]<stderr>:  single_pass: false
  98. [1]<stderr>:
  99. [2]<stderr>:INFO:tensorflow:Mixed precision compatibility check (mixed_float16): OK
  100. [2]<stderr>:Your GPUs will likely run quickly with dtype policy mixed_float16 as they all have compute capability of at least 7.0
  101. [1]<stderr>:INFO:tensorflow:Mixed precision compatibility check (mixed_float16): OK
  102. [1]<stderr>:Your GPUs will likely run quickly with dtype policy mixed_float16 as they all have compute capability of at least 7.0
  103. [3]<stderr>:2021-11-24 09:32:41.501168: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0, 1, 2, 3
  104. [3]<stderr>:2021-11-24 09:32:41.501227: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1258] Device interconnect StreamExecutor with strength 1 edge matrix:
  105. [3]<stderr>:2021-11-24 09:32:41.501238: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1264]      0 1 2 3
  106. [3]<stderr>:2021-11-24 09:32:41.501244: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 0:   N Y Y Y
  107. [3]<stderr>:2021-11-24 09:32:41.501248: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 1:   Y N Y Y
  108. [3]<stderr>:2021-11-24 09:32:41.501252: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 2:   Y Y N Y
  109. [3]<stderr>:2021-11-24 09:32:41.501255: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 3:   Y Y Y N
  110. [3]<stderr>:2021-11-24 09:32:41.502881: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 21913 MB memory) -> physical GPU (device: 0, name: Quadro RTX 6000, pci bus id: 0000:8e:00.0, compute capability: 7.5)
  111. [3]<stderr>:2021-11-24 09:32:41.503260: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 21913 MB memory) -> physical GPU (device: 1, name: Quadro RTX 6000, pci bus id: 0000:9c:00.0, compute capability: 7.5)
  112. [3]<stderr>:2021-11-24 09:32:41.503595: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 21913 MB memory) -> physical GPU (device: 2, name: Quadro RTX 6000, pci bus id: 0000:ce:00.0, compute capability: 7.5)
  113. [3]<stderr>:2021-11-24 09:32:41.503924: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 21913 MB memory) -> physical GPU (device: 3, name: Quadro RTX 6000, pci bus id: 0000:dc:00.0, compute capability: 7.5)
  114. [3]<stderr>:INFO:tensorflow:Using parameters:
  115. [3]<stderr>:data:
  116. [3]<stderr>:  eval_features_file: gen_enfr/data_src.valid.tok
  117. [3]<stderr>:  eval_labels_file: gen_enfr/data_tgt.valid.tok
  118. [3]<stderr>:  source_vocabulary: gen_enfr/bpe_src.vocab
  119. [3]<stderr>:  target_vocabulary: gen_enfr/bpe_tgt.vocab
  120. [3]<stderr>:  train_features_file: gen_enfr/data_src.train.tok
  121. [3]<stderr>:  train_labels_file: gen_enfr/data_tgt.train.tok
  122. [3]<stderr>:eval:
  123. [3]<stderr>:  batch_size: 32
  124. [3]<stderr>:  batch_type: examples
  125. [3]<stderr>:  early_stopping:
  126. [3]<stderr>:    metric: bleu
  127. [3]<stderr>:    min_improvement: 0.01
  128. [3]<stderr>:    steps: 3
  129. [3]<stderr>:  external_evaluators: BLEU
  130. [3]<stderr>:  length_bucket_width: 5
  131. [3]<stderr>:  save_eval_predictions: false
  132. [3]<stderr>:  steps: 5000
  133. [3]<stderr>:infer:
  134. [3]<stderr>:  batch_size: 32
  135. [3]<stderr>:  batch_type: examples
  136. [3]<stderr>:  bucket_width: 5
  137. [3]<stderr>:  length_bucket_width: 5
  138. [3]<stderr>:model_dir: gen_enfr/run
  139. [3]<stderr>:params:
  140. [3]<stderr>:  average_loss_in_time: true
  141. [3]<stderr>:  beam_width: 2
  142. [3]<stderr>:  contrastive_learning: false
  143. [3]<stderr>:  coverage_penalty: 0
  144. [3]<stderr>:  decay_params:
  145. [3]<stderr>:    model_dim: 1024
  146. [3]<stderr>:    warmup_steps: 8000
  147. [3]<stderr>:  decay_type: NoamDecay
  148. [3]<stderr>:  decoding_subword_token: "\uFFED"'TF_GPU_ALLOCATOR=cuda_malloc_async'"notebooks/pipeline/train.py", line 22, in <module>
  149. [3]<stderr>:    Trainer(**config['train']['trainer']).run()
  150. [3]<stderr>:  File "/opt/mt/miniconda3/envs/horovod/lib/python3.8/site-packages/cdtnice/common/pipeline.py", line 34, in wrapped_f
  151. [3]<stderr>:    raise e
  152. [3]<stderr>:  File "/opt/mt/miniconda3/envs/horovod/lib/python3.8/site-packages/cdtnice/common/pipeline.py", line 30, in wrapped_f
  153. [3]<stderr>:    function_return_value = f(*args, **kwargs)
  154. [3]<stderr>:  File "/opt/mt/miniconda3/envs/horovod/lib/python3.8/site-packages/cdtnice/train/train.py", line 48, in run
  155. [3]<stderr>:    final_model_dir, train_summary = runner.train(
  156. [3]<stderr>:  File "/opt/mt/miniconda3/envs/horovod/lib/python3.8/site-packages/opennmt/runner.py", line 276, in train
  157. [3]<stderr>:    summary = trainer(
  158. [3]<stderr>:  File "/opt/mt/miniconda3/envs/horovod/lib/python3.8/site-packages/opennmt/training.py", line 121, in __call__
  159. [3]<stderr>:    for i, loss in enumerate(
  160. [3]<stderr>:  File "/opt/mt/miniconda3/envs/horovod/lib/python3.8/site-packages/opennmt/training.py", line 262, in _steps
  161. [0]<stderr>:2021-11-24 09:33:01.803181: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 2.2K (2304 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
  162. [3]<stderr>:    loss = forward_fn()
  163. [3]<stderr>:  File "/opt/mt/miniconda3/envs/horovod/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 889, in __call__
  164. [3]<stderr>:    result = self._call(*args, **kwds)
  165. [3]<stderr>:  File "/opt/mt/miniconda3/envs/horovod/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 933, in _call
  166. [3]<stderr>:    self._initialize(args, kwds, add_initializers_to=initializers)
  167. [3]<stderr>:  File "/opt/mt/miniconda3/envs/horovod/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 763, in _initialize
  168. [3]<stderr>:    self._stateful_fn._get_concrete_function_internal_garbage_collected(  # pylint: disable=protected-access
  169. [3]<stderr>:  File "/opt/mt/miniconda3/envs/horovod/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 3050, in _get_concrete_function_internal_garbage_collected
  170. [3]<stderr>:    graph_function, _ = self._maybe_define_function(args, kwargs)
  171. [3]<stderr>:  File "/opt/mt/miniconda3/envs/horovod/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 3444, in _maybe_define_function
  172. [3]<stderr>:    graph_function = self._create_graph_function(args, kwargs)
  173. [3]<stderr>:  File "/opt/mt/miniconda3/envs/horovod/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 3279, in _create_graph_function
  174. [3]<stderr>:    func_graph_module.func_graph_from_py_func(
  175. [3]<stderr>:  File "/opt/mt/miniconda3/envs/horovod/lib/python3.8/site-packages/tensorflow/python/framework/func_graph.py", line 999, in func_graph_from_py_func
  176. [3]<stderr>:    func_outputs = python_func(*func_args, **func_kwargs)
  177. [3]<stderr>:  File "/opt/mt/miniconda3/envs/horovod/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 672, in wrapped_fn
  178. [3]<stderr>:    out = weak_wrapped_fn().__wrapped__(*args, **kwds)
  179. [3]<stderr>:  File "/opt/mt/miniconda3/envs/horovod/lib/python3.8/site-packages/tensorflow/python/framework/func_graph.py", line 986, in wrapper
  180. [0]<stderr>:2021-11-24 09:33:01.805553: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 2.2K (2304 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
  181. [3]<stderr>:    raise e.ag_error_metadata.to_exception(e)
  182. [3]<stderr>:tensorflow.python.framework.errors_impl.InternalError: in user code:
  183. [3]<stderr>:
  184. [3]<stderr>:    /opt/mt/miniconda3/envs/horovod/lib/python3.8/site-packages/opennmt/training.py:247 _forward  *
  185. [3]<stderr>:        target,
  186. [3]<stderr>:    /opt/mt/miniconda3/envs/horovod/lib/python3.8/site-packages/opennmt/training.py:329 _forward  *
  187. [3]<stderr>:        loss, gradients = self._compute_gradients(
  188. [3]<stderr>:    /opt/mt/miniconda3/envs/horovod/lib/python3.8/site-packages/opennmt/training.py:311 _compute_gradients  *
  189. [3]<stderr>:        reported_loss, gradients = self._model.compute_gradients(
  190. [3]<stderr>:    /opt/mt/miniconda3/envs/horovod/lib/python3.8/site-packages/opennmt/models/model.py:223 _compute_loss  *
  191. [3]<stderr>:        train_loss, report_loss = self.compute_training_loss(
  192. [3]<stderr>:    /opt/mt/miniconda3/envs/horovod/lib/python3.8/site-packages/opennmt/models/model.py:263 compute_training_loss  *
  193. [3]<stderr>:        outputs, _ = self(features, labels, training=True, step=step)
  194. [3]<stderr>:    /opt/mt/miniconda3/envs/horovod/lib/python3.8/site-packages/opennmt/models/model.py:102 __call__  *
  195. [3]<stderr>:        outputs, predictions = super().__call__(
  196. [3]<stderr>:    /opt/mt/miniconda3/envs/horovod/lib/python3.8/site-packages/tensorflow/python/keras/engine/base_layer.py:1023 __call__  **
  197. [3]<stderr>:        self._maybe_build(inputs)
  198. [3]<stderr>:    /opt/mt/miniconda3/envs/horovod/lib/python3.8/site-packages/tensorflow/python/keras/engine/base_layer.py:2625 _maybe_build
  199. [3]<stderr>:        self.build(input_shapes)  # pylint:disable=not-callable
  200. "/opt/mt/miniconda3/envs/horovod/bin/horovodrun", line 8, in <module>
  201.     sys.exit(run_commandline())
  202.   File "/opt/mt/miniconda3/envs/horovod/lib/python3.8/site-packages/horovod/runner/launch.py", line 770, in run_commandline
  203.     _run(args)
  204.   File "/opt/mt/miniconda3/envs/horovod/lib/python3.8/site-packages/horovod/runner/launch.py", line 760, in _run
  205.     return _run_static(args)
  206.   File "/opt/mt/miniconda3/envs/horovod/lib/python3.8/site-packages/horovod/runner/launch.py", line 617, in _run_static
  207.     _launch_job(args, settings, nics, command)
  208.   File "/opt/mt/miniconda3/envs/horovod/lib/python3.8/site-packages/horovod/runner/launch.py", line 730, in _launch_job
  209.     run_controller(args.use_gloo, gloo_run_fn,
  210.   File "/opt/mt/miniconda3/envs/horovod/lib/python3.8/site-packages/horovod/runner/launch.py", line 706, in run_controller
  211.     gloo_run()
  212.   File "/opt/mt/miniconda3/envs/horovod/lib/python3.8/site-packages/horovod/runner/launch.py", line 722, in gloo_run_fn
  213.     gloo_run(settings, nics, env, driver_ip, command)
  214.   File "/opt/mt/miniconda3/envs/horovod/lib/python3.8/site-packages/horovod/runner/gloo_run.py", line 298, in gloo_run
  215.     launch_gloo(command, exec_command, settings, nics, env, server_ip)
  216.   File "/opt/mt/miniconda3/envs/horovod/lib/python3.8/site-packages/horovod/runner/gloo_run.py", line 282, in launch_gloo
  217.     raise RuntimeError('Horovod detected that one or more processes exited with non-zero '
  218. RuntimeError: Horovod detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:
  219. Process name: 3
  220. Exit code: 1
  221.  
  222.  
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement