Advertisement
Guest User

Untitled

a guest
May 10th, 2023
48
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 27.97 KB | None | 0 0
  1. [2023-05-10 14:22:47,947] [WARNING] [runner.py:186:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
  2. [2023-05-10 14:22:47,974] [INFO] [runner.py:550:main] cmd = /local_disk0/.ephemeral_nfs/envs/pythonEnv-f3960492-872b-41ef-8421-13ffa315278b/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgM119 --master_addr=127.0.0.1 --master_port=29500 --module --enable_each_rank_log=None training.trainer --input-model Databricks/dolly-v2-3b --deepspeed /Workspace/Repos/opyate@gmail.com/dolly/config/ds_z3_bf16_config.json --epochs 2 --local-output-dir /local_disk0/dolly_training/dolly_full_opyate-dolly15kFormat-singleField-sansRepeatedFields__2023-05-10T14:22:38 --dbfs-output-dir /dbfs/dolly_training/dolly_full_opyate-dolly15kFormat-singleField-sansRepeatedFields__2023-05-10T14:22:38 --per-device-train-batch-size 3 --per-device-eval-batch-size 3 --logging-steps 10 --save-steps 200 --save-total-limit 20 --eval-steps 50 --warmup-steps 50 --test-size 100 --lr 5e-08
  3. [2023-05-10 14:22:51,472] [INFO] [launch.py:142:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3]}
  4. [2023-05-10 14:22:51,472] [INFO] [launch.py:148:main] nnodes=1, num_local_procs=4, node_rank=0
  5. [2023-05-10 14:22:51,472] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3]})
  6. [2023-05-10 14:22:51,472] [INFO] [launch.py:162:main] dist_world_size=4
  7. [2023-05-10 14:22:51,472] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3
  8. 2023-05-10 14:22:54.158165: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
  9. To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
  10. 2023-05-10 14:22:54.175951: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
  11. To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
  12. 2023-05-10 14:22:54.186105: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
  13. To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
  14. 2023-05-10 14:22:54.211252: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
  15. To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
  16. 2023-05-10 14:23:02 INFO [__main__] Loading tokenizer for Databricks/dolly-v2-3b
  17. 2023-05-10 14:23:02 INFO [__main__] Loading tokenizer for Databricks/dolly-v2-3b
  18. 2023-05-10 14:23:02 INFO [__main__] Loading tokenizer for Databricks/dolly-v2-3b
  19. 2023-05-10 14:23:02 INFO [__main__] Loading tokenizer for Databricks/dolly-v2-3b
  20. 2023-05-10 14:23:02 INFO [__main__] Loading model for Databricks/dolly-v2-3b
  21. 2023-05-10 14:23:02 INFO [__main__] Loading model for Databricks/dolly-v2-3b
  22. 2023-05-10 14:23:02 INFO [__main__] Loading model for Databricks/dolly-v2-3b
  23. 2023-05-10 14:23:02 INFO [__main__] Loading model for Databricks/dolly-v2-3b
  24. Downloading (…)lve/main/config.json: 100%|██████| 819/819 [00:00<00:00, 251kB/s]
  25. Downloading pytorch_model.bin: 48%|███▍ | 2.75G/5.68G [01:08<01:15, 39.3MB/s]
  26.  
  27. *** WARNING: max output size exceeded, skipping output. ***
  28.  
  29. ansRepeatedFields-33ac744e6fc2c747/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec/cache-deaccf5a92a20394.arrow
  30. 2023-05-10 14:26:02 INFO [__main__] Train data size: 4375
  31. 2023-05-10 14:26:02 INFO [__main__] Test data size: 100
  32. [2023-05-10 14:26:02,273] [INFO] [comm.py:652:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
  33. 2023-05-10 14:26:02 INFO [__main__] Processed dataset has 4475 rows after filtering for truncated records
  34. 2023-05-10 14:26:02 INFO [__main__] Shuffling dataset
  35. 2023-05-10 14:26:02 WARNING [datasets.arrow_dataset] Loading cached shuffled indices for dataset at /root/.cache/huggingface/datasets/opyate___parquet/opyate--full_opyate-dolly15kFormat-singleField-sansRepeatedFields-33ac744e6fc2c747/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec/cache-89d59fde5ca9fa60.arrow
  36. 2023-05-10 14:26:02 INFO [__main__] Done preprocessing
  37. 2023-05-10 14:26:02 WARNING [datasets.arrow_dataset] Loading cached split indices for dataset at /root/.cache/huggingface/datasets/opyate___parquet/opyate--full_opyate-dolly15kFormat-singleField-sansRepeatedFields-33ac744e6fc2c747/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec/cache-18b33dbd2e0f55ba.arrow and /root/.cache/huggingface/datasets/opyate___parquet/opyate--full_opyate-dolly15kFormat-singleField-sansRepeatedFields-33ac744e6fc2c747/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec/cache-deaccf5a92a20394.arrow
  38. 2023-05-10 14:26:02 INFO [__main__] Train data size: 4375
  39. 2023-05-10 14:26:02 INFO [__main__] Test data size: 100
  40. 2023-05-10 14:26:02 INFO [torch.distributed.distributed_c10d] Added key: store_based_barrier_key:1 to store for rank: 2
  41. 2023-05-10 14:26:02 INFO [torch.distributed.distributed_c10d] Added key: store_based_barrier_key:1 to store for rank: 3
  42. 2023-05-10 14:26:03 INFO [torch.distributed.distributed_c10d] Added key: store_based_barrier_key:1 to store for rank: 1
  43. 2023-05-10 14:26:03 INFO [torch.distributed.distributed_c10d] Added key: store_based_barrier_key:1 to store for rank: 0
  44. 2023-05-10 14:26:03 INFO [torch.distributed.distributed_c10d] Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
  45. 2023-05-10 14:26:03 INFO [torch.distributed.distributed_c10d] Rank 3: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
  46. 2023-05-10 14:26:03 INFO [torch.distributed.distributed_c10d] Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
  47. 2023-05-10 14:26:03 INFO [torch.distributed.distributed_c10d] Rank 2: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
  48. 2023-05-10 14:26:03 INFO [__main__] Instantiating Trainer
  49. 2023-05-10 14:26:03 INFO [__main__] Instantiating Trainer
  50. 2023-05-10 14:26:03 INFO [__main__] Instantiating Trainer
  51. 2023-05-10 14:26:03 INFO [__main__] Instantiating Trainer
  52. 2023-05-10 14:26:03 INFO [__main__] Training
  53. 2023-05-10 14:26:03 INFO [__main__] Training
  54. 2023-05-10 14:26:03 INFO [__main__] Training
  55. 2023-05-10 14:26:03 INFO [__main__] Training
  56. 2023-05-10 14:26:08 INFO [torch.distributed.distributed_c10d] Added key: store_based_barrier_key:2 to store for rank: 1
  57. 2023-05-10 14:26:08 INFO [torch.distributed.distributed_c10d] Added key: store_based_barrier_key:2 to store for rank: 3
  58. 2023-05-10 14:26:08 INFO [torch.distributed.distributed_c10d] Added key: store_based_barrier_key:2 to store for rank: 0
  59. 2023-05-10 14:26:08 INFO [torch.distributed.distributed_c10d] Added key: store_based_barrier_key:2 to store for rank: 2
  60. 2023-05-10 14:26:08 INFO [torch.distributed.distributed_c10d] Rank 2: Completed store-based barrier for key:store_based_barrier_key:2 with 4 nodes.
  61. 2023-05-10 14:26:08 INFO [torch.distributed.distributed_c10d] Rank 3: Completed store-based barrier for key:store_based_barrier_key:2 with 4 nodes.
  62. 2023-05-10 14:26:08 INFO [torch.distributed.distributed_c10d] Rank 0: Completed store-based barrier for key:store_based_barrier_key:2 with 4 nodes.
  63. 2023-05-10 14:26:08 INFO [torch.distributed.distributed_c10d] Rank 1: Completed store-based barrier for key:store_based_barrier_key:2 with 4 nodes.
  64. Installed CUDA version 11.3 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
  65. Installed CUDA version 11.3 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
  66. Installed CUDA version 11.3 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
  67. Installed CUDA version 11.3 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
  68. Using /root/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
  69. Creating extension directory /root/.cache/torch_extensions/py39_cu117/cpu_adam...
  70. Using /root/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
  71. Using /root/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
  72. Using /root/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
  73. Detected CUDA files, patching ldflags
  74. Emitting ninja build file /root/.cache/torch_extensions/py39_cu117/cpu_adam/build.ninja...
  75. Building extension module cpu_adam...
  76. Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
  77. [1/3] /usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/local_disk0/.ephemeral_nfs/envs/pythonEnv-f3960492-872b-41ef-8421-13ffa315278b/lib/python3.9/site-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda/include -isystem /databricks/python/lib/python3.9/site-packages/torch/include -isystem /databricks/python/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /databricks/python/lib/python3.9/site-packages/torch/include/TH -isystem /databricks/python/lib/python3.9/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++14 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_86,code=sm_86 -gencode=arch=compute_86,code=compute_86 -c /local_disk0/.ephemeral_nfs/envs/pythonEnv-f3960492-872b-41ef-8421-13ffa315278b/lib/python3.9/site-packages/deepspeed/ops/csrc/common/custom_cuda_kernel.cu -o custom_cuda_kernel.cuda.o
  78. [2/3] c++ -MMD -MF cpu_adam.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/local_disk0/.ephemeral_nfs/envs/pythonEnv-f3960492-872b-41ef-8421-13ffa315278b/lib/python3.9/site-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda/include -isystem /databricks/python/lib/python3.9/site-packages/torch/include -isystem /databricks/python/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /databricks/python/lib/python3.9/site-packages/torch/include/TH -isystem /databricks/python/lib/python3.9/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -O3 -std=c++14 -g -Wno-reorder -L/usr/local/cuda/lib64 -lcudart -lcublas -g -march=native -fopenmp -D__AVX256__ -D__ENABLE_CUDA__ -c /local_disk0/.ephemeral_nfs/envs/pythonEnv-f3960492-872b-41ef-8421-13ffa315278b/lib/python3.9/site-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp -o cpu_adam.o
  79. [3/3] c++ cpu_adam.o custom_cuda_kernel.cuda.o -shared -lcurand -L/databricks/python/lib/python3.9/site-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda_cu -ltorch_cuda_cpp -ltorch -ltorch_python -L/usr/local/cuda/lib64 -lcudart -o cpu_adam.so
  80. Loading extension module cpu_adam...
  81. Time to load cpu_adam op: 29.23264479637146 seconds
  82. Loading extension module cpu_adam...
  83. Time to load cpu_adam op: 29.250458478927612 seconds
  84. Loading extension module cpu_adam...
  85. Time to load cpu_adam op: 29.303073167800903 seconds
  86. Loading extension module cpu_adam...
  87. Time to load cpu_adam op: 29.34495234489441 seconds
  88. Using /root/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
  89. Creating extension directory /root/.cache/torch_extensions/py39_cu117/utils...
  90. Using /root/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
  91. Using /root/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
  92. Using /root/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
  93. Emitting ninja build file /root/.cache/torch_extensions/py39_cu117/utils/build.ninja...
  94. Building extension module utils...
  95. Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
  96. [1/2] c++ -MMD -MF flatten_unflatten.o.d -DTORCH_EXTENSION_NAME=utils -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /databricks/python/lib/python3.9/site-packages/torch/include -isystem /databricks/python/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /databricks/python/lib/python3.9/site-packages/torch/include/TH -isystem /databricks/python/lib/python3.9/site-packages/torch/include/THC -isystem /usr/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -c /local_disk0/.ephemeral_nfs/envs/pythonEnv-f3960492-872b-41ef-8421-13ffa315278b/lib/python3.9/site-packages/deepspeed/ops/csrc/utils/flatten_unflatten.cpp -o flatten_unflatten.o
  97. [2/2] c++ flatten_unflatten.o -shared -L/databricks/python/lib/python3.9/site-packages/torch/lib -lc10 -ltorch_cpu -ltorch -ltorch_python -o utils.so
  98. Loading extension module utils...
  99. Time to load utils op: 15.866542100906372 seconds
  100. Loading extension module utils...
  101. Time to load utils op: 15.822625398635864 seconds
  102. Loading extension module utils...
  103. Loading extension module utils...
  104. Time to load utils op: 15.722923040390015 seconds
  105. Time to load utils op: 15.924039602279663 seconds
  106. Parameter Offload: Total persistent parameters: 1070080 in 258 params
  107. Using /root/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
  108. No modifications detected for re-loaded extension module utils, skipping build step...
  109. Loading extension module utils...
  110. Using /root/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...Using /root/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
  111. Time to load utils op: 0.00039696693420410156 seconds
  112.  
  113. No modifications detected for re-loaded extension module utils, skipping build step...No modifications detected for re-loaded extension module utils, skipping build step...
  114.  
  115. Loading extension module utils...Loading extension module utils...
  116.  
  117. Time to load utils op: 0.0004241466522216797 secondsTime to load utils op: 0.00046563148498535156 seconds
  118.  
  119. You're using a GPTNeoXTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
  120. You're using a GPTNeoXTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
  121. You're using a GPTNeoXTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
  122. /databricks/python/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:2387: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
  123. warnings.warn(
  124. /databricks/python/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:2387: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
  125. warnings.warn(
  126. /databricks/python/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:2387: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
  127. warnings.warn(
  128. Using /root/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
  129. No modifications detected for re-loaded extension module utils, skipping build step...
  130. Loading extension module utils...
  131. Time to load utils op: 0.000339508056640625 seconds
  132. You're using a GPTNeoXTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
  133. /databricks/python/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:2387: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
  134. warnings.warn(
  135. /databricks/python/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:2849: UserWarning: torch.distributed._reduce_scatter_base is a private function and will be deprecated. Please use torch.distributed.reduce_scatter_tensor instead.
  136. warnings.warn(
  137. /databricks/python/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:2849: UserWarning: torch.distributed._reduce_scatter_base is a private function and will be deprecated. Please use torch.distributed.reduce_scatter_tensor instead.
  138. warnings.warn(
  139. /databricks/python/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:2849: UserWarning: torch.distributed._reduce_scatter_base is a private function and will be deprecated. Please use torch.distributed.reduce_scatter_tensor instead.
  140. warnings.warn(
  141. /databricks/python/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:2849: UserWarning: torch.distributed._reduce_scatter_base is a private function and will be deprecated. Please use torch.distributed.reduce_scatter_tensor instead.
  142. warnings.warn(
  143. {'loss': 4.891, 'learning_rate': 2.942959550338895e-08, 'epoch': 0.03}
  144. {'loss': 5.2213, 'learning_rate': 3.828878651016684e-08, 'epoch': 0.05}
  145. {'loss': 4.6145, 'learning_rate': 4.3471081035858023e-08, 'epoch': 0.08}
  146. {'loss': 4.5668, 'learning_rate': 4.714797751694474e-08, 'epoch': 0.11}
  147. {'loss': 4.4887, 'learning_rate': 5e-08, 'epoch': 0.14}
  148. {'eval_loss': 4.907109260559082, 'eval_runtime': 12.7109, 'eval_samples_per_second': 7.867, 'eval_steps_per_second': 0.708, 'epoch': 0.14}
  149. [2023-05-10 14:34:19,666] [WARNING] [stage3.py:1942:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
  150. {'loss': 4.7064, 'learning_rate': 5e-08, 'epoch': 0.16}
  151. {'loss': 4.5375, 'learning_rate': 5e-08, 'epoch': 0.19}
  152. {'loss': 4.2311, 'learning_rate': 5e-08, 'epoch': 0.22}
  153. {'loss': 4.0389, 'learning_rate': 5e-08, 'epoch': 0.25}
  154. {'loss': 4.2023, 'learning_rate': 5e-08, 'epoch': 0.27}
  155. {'eval_loss': 3.878671884536743, 'eval_runtime': 12.2988, 'eval_samples_per_second': 8.131, 'eval_steps_per_second': 0.732, 'epoch': 0.27}
  156. {'loss': 3.3199, 'learning_rate': 5e-08, 'epoch': 0.3}
  157. {'loss': 3.0863, 'learning_rate': 5e-08, 'epoch': 0.33}
  158. [2023-05-10 14:44:22,749] [WARNING] [stage3.py:1942:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
  159. {'loss': 3.0975, 'learning_rate': 5e-08, 'epoch': 0.36}
  160. {'loss': 3.252, 'learning_rate': 5e-08, 'epoch': 0.38}
  161. {'loss': 2.7237, 'learning_rate': 5e-08, 'epoch': 0.41}
  162. {'eval_loss': 3.129335880279541, 'eval_runtime': 12.2544, 'eval_samples_per_second': 8.16, 'eval_steps_per_second': 0.734, 'epoch': 0.41}
  163. {'loss': 2.6078, 'learning_rate': 5e-08, 'epoch': 0.44}
  164. {'loss': 2.0998, 'learning_rate': 5e-08, 'epoch': 0.47}
  165. {'loss': 1.67, 'learning_rate': 5e-08, 'epoch': 0.49}
  166. {'loss': 1.2625, 'learning_rate': 5e-08, 'epoch': 0.52}
  167. {'loss': 1.4132, 'learning_rate': 5e-08, 'epoch': 0.55}
  168. {'eval_loss': 1.5732324123382568, 'eval_runtime': 12.2615, 'eval_samples_per_second': 8.156, 'eval_steps_per_second': 0.734, 'epoch': 0.55}
  169. /databricks/python/lib/python3.9/site-packages/torch/nn/modules/module.py:1432: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
  170. warnings.warn(
  171. /databricks/python/lib/python3.9/site-packages/torch/nn/modules/module.py:1432: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
  172. warnings.warn(
  173. /databricks/python/lib/python3.9/site-packages/torch/nn/modules/module.py:1432: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
  174. warnings.warn(
  175. /databricks/python/lib/python3.9/site-packages/torch/nn/modules/module.py:1432: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
  176. warnings.warn(
  177. {'loss': 1.3533, 'learning_rate': 5e-08, 'epoch': 0.58}
  178. {'loss': 0.9929, 'learning_rate': 5e-08, 'epoch': 0.6}
  179. {'loss': 0.9855, 'learning_rate': 5e-08, 'epoch': 0.63}
  180. {'loss': 0.985, 'learning_rate': 5e-08, 'epoch': 0.66}
  181. {'loss': 1.3447, 'learning_rate': 5e-08, 'epoch': 0.68}
  182. {'eval_loss': 1.2489745616912842, 'eval_runtime': 12.2325, 'eval_samples_per_second': 8.175, 'eval_steps_per_second': 0.736, 'epoch': 0.68}
  183. {'loss': 0.9832, 'learning_rate': 5e-08, 'epoch': 0.71}
  184. {'loss': 0.9459, 'learning_rate': 5e-08, 'epoch': 0.74}
  185. {'loss': 0.7739, 'learning_rate': 5e-08, 'epoch': 0.77}
  186. {'loss': 0.7065, 'learning_rate': 5e-08, 'epoch': 0.79}
  187. {'loss': 0.7643, 'learning_rate': 5e-08, 'epoch': 0.82}
  188. {'eval_loss': 1.0137401819229126, 'eval_runtime': 12.2525, 'eval_samples_per_second': 8.162, 'eval_steps_per_second': 0.735, 'epoch': 0.82}
  189. {'loss': 0.7002, 'learning_rate': 5e-08, 'epoch': 0.85}
  190. {'loss': 0.6191, 'learning_rate': 5e-08, 'epoch': 0.88}
  191. {'loss': 0.543, 'learning_rate': 5e-08, 'epoch': 0.9}
  192. {'loss': 0.5221, 'learning_rate': 5e-08, 'epoch': 0.93}
  193. {'loss': 0.4356, 'learning_rate': 5e-08, 'epoch': 0.96}
  194. {'eval_loss': 0.7561206221580505, 'eval_runtime': 12.2186, 'eval_samples_per_second': 8.184, 'eval_steps_per_second': 0.737, 'epoch': 0.96}
  195. {'loss': 0.5103, 'learning_rate': 5e-08, 'epoch': 0.99}
  196. {'loss': 0.4477, 'learning_rate': 5e-08, 'epoch': 1.01}
  197. {'loss': 0.4288, 'learning_rate': 5e-08, 'epoch': 1.04}
  198. {'loss': 0.4839, 'learning_rate': 5e-08, 'epoch': 1.07}
  199. {'loss': 0.5106, 'learning_rate': 5e-08, 'epoch': 1.1}
  200. {'eval_loss': 0.6384289264678955, 'eval_runtime': 12.2222, 'eval_samples_per_second': 8.182, 'eval_steps_per_second': 0.736, 'epoch': 1.1}
  201. {'loss': 0.3922, 'learning_rate': 5e-08, 'epoch': 1.12}
  202. {'loss': 0.2716, 'learning_rate': 5e-08, 'epoch': 1.15}
  203. {'loss': 0.3343, 'learning_rate': 5e-08, 'epoch': 1.18}
  204. {'loss': 0.3212, 'learning_rate': 5e-08, 'epoch': 1.21}
  205. {'loss': 0.3245, 'learning_rate': 5e-08, 'epoch': 1.23}
  206. {'eval_loss': 0.5844647288322449, 'eval_runtime': 12.2113, 'eval_samples_per_second': 8.189, 'eval_steps_per_second': 0.737, 'epoch': 1.23}
  207. {'loss': 0.2212, 'learning_rate': 5e-08, 'epoch': 1.26}
  208. {'loss': 0.3507, 'learning_rate': 5e-08, 'epoch': 1.29}
  209. {'loss': 0.2208, 'learning_rate': 5e-08, 'epoch': 1.32}
  210. {'loss': 0.1916, 'learning_rate': 5e-08, 'epoch': 1.34}
  211. {'loss': 0.2722, 'learning_rate': 5e-08, 'epoch': 1.37}
  212. {'eval_loss': 0.5478832721710205, 'eval_runtime': 12.2705, 'eval_samples_per_second': 8.15, 'eval_steps_per_second': 0.733, 'epoch': 1.37}
  213. {'loss': 0.2137, 'learning_rate': 5e-08, 'epoch': 1.4}
  214. {'loss': 0.3299, 'learning_rate': 5e-08, 'epoch': 1.42}
  215. {'loss': 0.2622, 'learning_rate': 5e-08, 'epoch': 1.45}
  216. {'loss': 0.178, 'learning_rate': 5e-08, 'epoch': 1.48}
  217. {'loss': 0.2463, 'learning_rate': 5e-08, 'epoch': 1.51}
  218. {'eval_loss': 0.5222082734107971, 'eval_runtime': 12.2953, 'eval_samples_per_second': 8.133, 'eval_steps_per_second': 0.732, 'epoch': 1.51}
  219. {'loss': 0.3726, 'learning_rate': 5e-08, 'epoch': 1.53}
  220. {'loss': 0.1293, 'learning_rate': 5e-08, 'epoch': 1.56}
  221. {'loss': 0.1672, 'learning_rate': 5e-08, 'epoch': 1.59}
  222. {'loss': 0.2082, 'learning_rate': 5e-08, 'epoch': 1.62}
  223. {'loss': 0.2532, 'learning_rate': 5e-08, 'epoch': 1.64}
  224. {'eval_loss': 0.5132836699485779, 'eval_runtime': 12.2652, 'eval_samples_per_second': 8.153, 'eval_steps_per_second': 0.734, 'epoch': 1.64}
  225. {'loss': 0.1827, 'learning_rate': 5e-08, 'epoch': 1.67}
  226. {'loss': 0.3797, 'learning_rate': 5e-08, 'epoch': 1.7}
  227. {'loss': 0.2611, 'learning_rate': 5e-08, 'epoch': 1.73}
  228. {'loss': 0.2326, 'learning_rate': 5e-08, 'epoch': 1.75}
  229. {'loss': 0.2182, 'learning_rate': 5e-08, 'epoch': 1.78}
  230. {'eval_loss': 0.4874853491783142, 'eval_runtime': 12.2741, 'eval_samples_per_second': 8.147, 'eval_steps_per_second': 0.733, 'epoch': 1.78}
  231. {'loss': 0.2055, 'learning_rate': 5e-08, 'epoch': 1.81}
  232. {'loss': 0.2792, 'learning_rate': 5e-08, 'epoch': 1.84}
  233. {'loss': 0.1139, 'learning_rate': 5e-08, 'epoch': 1.86}
  234. {'loss': 0.3153, 'learning_rate': 5e-08, 'epoch': 1.89}
  235. {'loss': 0.1963, 'learning_rate': 5e-08, 'epoch': 1.92}
  236. {'eval_loss': 0.4782702624797821, 'eval_runtime': 12.2463, 'eval_samples_per_second': 8.166, 'eval_steps_per_second': 0.735, 'epoch': 1.92}
  237. {'loss': 0.3411, 'learning_rate': 5e-08, 'epoch': 1.95}
  238. {'loss': 0.1968, 'learning_rate': 5e-08, 'epoch': 1.97}
  239. {'loss': 0.2507, 'learning_rate': 5e-08, 'epoch': 2.0}
  240. {'train_runtime': 6183.9935, 'train_samples_per_second': 1.415, 'train_steps_per_second': 0.118, 'train_loss': 1.2808417751364511, 'epoch': 2.0}
  241. 2023-05-10 16:10:15 INFO [__main__] Saving Model to /local_disk0/dolly_training/dolly_full_opyate-dolly15kFormat-singleField-sansRepeatedFields__2023-05-10T14:22:38
  242. 2023-05-10 16:10:15 INFO [__main__] Saving Model to /local_disk0/dolly_training/dolly_full_opyate-dolly15kFormat-singleField-sansRepeatedFields__2023-05-10T14:22:38
  243. 2023-05-10 16:10:15 INFO [__main__] Saving Model to /local_disk0/dolly_training/dolly_full_opyate-dolly15kFormat-singleField-sansRepeatedFields__2023-05-10T14:22:38
  244. 2023-05-10 16:10:15 INFO [__main__] Saving Model to /local_disk0/dolly_training/dolly_full_opyate-dolly15kFormat-singleField-sansRepeatedFields__2023-05-10T14:22:38
  245. 2023-05-10 16:10:20 INFO [__main__] Saving Model to /dbfs/dolly_training/dolly_full_opyate-dolly15kFormat-singleField-sansRepeatedFields__2023-05-10T14:22:38
  246. 2023-05-10 16:10:20 INFO [__main__] Saving Model to /dbfs/dolly_training/dolly_full_opyate-dolly15kFormat-singleField-sansRepeatedFields__2023-05-10T14:22:38
  247. 2023-05-10 16:10:20 INFO [__main__] Saving Model to /dbfs/dolly_training/dolly_full_opyate-dolly15kFormat-singleField-sansRepeatedFields__2023-05-10T14:22:38
  248. 2023-05-10 16:10:25 INFO [__main__] Saving Model to /dbfs/dolly_training/dolly_full_opyate-dolly15kFormat-singleField-sansRepeatedFields__2023-05-10T14:22:38
  249. 2023-05-10 16:10:32 INFO [__main__] Done.
  250. 2023-05-10 16:10:32 INFO [__main__] Done.
  251. 2023-05-10 16:10:32 INFO [__main__] Done.
  252. [2023-05-10 16:10:36,554] [INFO] [launch.py:350:main] Process 3529 exits successfully.
  253. [2023-05-10 16:10:38,557] [INFO] [launch.py:350:main] Process 3528 exits successfully.
  254. [2023-05-10 16:10:38,557] [INFO] [launch.py:350:main] Process 3530 exits successfully.
  255. 2023-05-10 16:10:43 INFO [__main__] Done.
  256. [2023-05-10 16:10:47,566] [INFO] [launch.py:350:main] Process 3527 exits successfully.
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement