Advertisement
Guest User

Untitled

a guest
May 2nd, 2023
21
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 26.75 KB | None | 0 0
  1. [2023-05-02 08:07:58,522] [WARNING] [runner.py:186:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
  2. [2023-05-02 08:07:58,549] [INFO] [runner.py:550:main] cmd = /local_disk0/.ephemeral_nfs/envs/pythonEnv-1938b734-28d5-4be1-b905-9dd6942415b9/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgM119 --master_addr=127.0.0.1 --master_port=29500 --module --enable_each_rank_log=None training.trainer --input-model Databricks/dolly-v2-7b --deepspeed /Workspace/Repos/opyate@gmail.com/dolly/config/ds_z3_bf16_config.json --epochs 2 --local-output-dir /local_disk0/dolly_training/dolly_mydataset-dolly15kFormat-noJSONSchema__2023-05-02T08:07:50 --dbfs-output-dir /dbfs/dolly_training/dolly_mydataset-dolly15kFormat-noJSONSchema__2023-05-02T08:07:50 --per-device-train-batch-size 3 --per-device-eval-batch-size 3 --logging-steps 10 --save-steps 200 --save-total-limit 20 --eval-steps 50 --warmup-steps 50 --test-size 10 --lr 5e-08
  3. [2023-05-02 08:08:01,978] [INFO] [launch.py:142:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3]}
  4. [2023-05-02 08:08:01,978] [INFO] [launch.py:148:main] nnodes=1, num_local_procs=4, node_rank=0
  5. [2023-05-02 08:08:01,978] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3]})
  6. [2023-05-02 08:08:01,978] [INFO] [launch.py:162:main] dist_world_size=4
  7. [2023-05-02 08:08:01,978] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3
  8. 2023-05-02 08:08:04.406544: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
  9. To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
  10. 2023-05-02 08:08:04.411289: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
  11. To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
  12. 2023-05-02 08:08:04.714580: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
  13. To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
  14. 2023-05-02 08:08:04.720707: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
  15. To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
  16. 2023-05-02 08:08:12 INFO [__main__] Loading tokenizer for Databricks/dolly-v2-7b
  17. 2023-05-02 08:08:12 INFO [__main__] Loading tokenizer for Databricks/dolly-v2-7b
  18. 2023-05-02 08:08:12 INFO [__main__] Loading tokenizer for Databricks/dolly-v2-7b
  19. 2023-05-02 08:08:12 INFO [__main__] Loading tokenizer for Databricks/dolly-v2-7b
  20. 2023-05-02 08:08:13 INFO [__main__] Loading model for Databricks/dolly-v2-7b
  21. 2023-05-02 08:08:13 INFO [__main__] Loading model for Databricks/dolly-v2-7b
  22. 2023-05-02 08:08:13 INFO [__main__] Loading model for Databricks/dolly-v2-7b
  23. 2023-05-02 08:08:13 INFO [__main__] Loading model for Databricks/dolly-v2-7b
  24. 2023-05-02 08:12:13 INFO [__main__] Found max lenth: 2048
  25. 2023-05-02 08:12:13 INFO [__main__] Checking if dataset is specific via env var DATASET_NAME
  26. 2023-05-02 08:12:13 INFO [__main__] Yes: opyate/mydataset-dolly15kFormat-noJSONSchema
  27. 2023-05-02 08:12:13 INFO [__main__] Found max lenth: 2048
  28. 2023-05-02 08:12:13 INFO [__main__] Checking if dataset is specific via env var DATASET_NAME
  29. 2023-05-02 08:12:13 INFO [__main__] Yes: opyate/mydataset-dolly15kFormat-noJSONSchema
  30. 2023-05-02 08:12:13 INFO [__main__] Found max lenth: 2048
  31. 2023-05-02 08:12:13 INFO [__main__] Checking if dataset is specific via env var DATASET_NAME
  32. 2023-05-02 08:12:13 INFO [__main__] Yes: opyate/mydataset-dolly15kFormat-noJSONSchema
  33. 2023-05-02 08:12:14 INFO [__main__] Found max lenth: 2048
  34. 2023-05-02 08:12:14 INFO [__main__] Checking if dataset is specific via env var DATASET_NAME
  35. 2023-05-02 08:12:14 INFO [__main__] Yes: opyate/mydataset-dolly15kFormat-noJSONSchema
  36. 2023-05-02 08:12:14 WARNING [datasets.builder] Found cached dataset parquet (/root/.cache/huggingface/datasets/opyate___parquet/opyate--mydataset-dolly15kFormat-noJSONSchema-66b5aa1793cd296a/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)
  37. 100%|████████████████████████████████████████████| 1/1 [00:00<00:00, 250.03it/s]
  38. 2023-05-02 08:12:14 INFO [__main__] Found 1032 rows
  39. Map: 0%| | 0/1032 [00:00<?, ? examples/s]2023-05-02 08:12:14 WARNING [datasets.builder] Found cached dataset parquet (/root/.cache/huggingface/datasets/opyate___parquet/opyate--mydataset-dolly15kFormat-noJSONSchema-66b5aa1793cd296a/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)
  40. 100%|████████████████████████████████████████████| 1/1 [00:00<00:00, 549.06it/s]
  41. 2023-05-02 08:12:14 INFO [__main__] Found 1032 rows
  42. Map: 0%| | 0/1032 [00:00<?, ? examples/s]2023-05-02 08:12:14 WARNING [datasets.builder] Found cached dataset parquet (/root/.cache/huggingface/datasets/opyate___parquet/opyate--mydataset-dolly15kFormat-noJSONSchema-66b5aa1793cd296a/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)
  43. 100%|████████████████████████████████████████████| 1/1 [00:00<00:00, 522.00it/s]
  44. 2023-05-02 08:12:14 INFO [__main__] Found 1032 rows
  45. 2023-05-02 08:12:14 INFO [__main__] Preprocessing dataset
  46. 2023-05-02 08:12:14 INFO [__main__] Preprocessing dataset
  47. 2023-05-02 08:12:14 WARNING [datasets.builder] Found cached dataset parquet (/root/.cache/huggingface/datasets/opyate___parquet/opyate--mydataset-dolly15kFormat-noJSONSchema-66b5aa1793cd296a/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)
  48. 100%|████████████████████████████████████████████| 1/1 [00:00<00:00, 485.11it/s]
  49. 2023-05-02 08:12:14 INFO [__main__] Found 1032 rows
  50. 2023-05-02 08:12:14 WARNING [datasets.arrow_dataset] Loading cached processed dataset at /root/.cache/huggingface/datasets/opyate___parquet/opyate--mydataset-dolly15kFormat-noJSONSchema-66b5aa1793cd296a/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec/cache-373a08c6145e0816.arrow
  51. 2023-05-02 08:12:14 INFO [__main__] Preprocessing dataset
  52. 2023-05-02 08:12:14 INFO [__main__] Preprocessing dataset
  53. 2023-05-02 08:12:16 INFO [__main__] Processed dataset has 1032 rows
  54. 2023-05-02 08:12:16 INFO [__main__] Processed dataset has 1032 rows
  55. 2023-05-02 08:12:16 INFO [__main__] Processed dataset has 1032 rows
  56. 2023-05-02 08:12:16 INFO [__main__] Processed dataset has 1032 rows
  57. 2023-05-02 08:12:16 INFO [__main__] Processed dataset has 1030 rows after filtering for truncated records
  58. 2023-05-02 08:12:16 INFO [__main__] Shuffling dataset
  59. Filter: 97%|█████████████████████▎| 1000/1032 [00:00<00:00, 1953.72 examples/s]2023-05-02 08:12:16 INFO [__main__] Done preprocessing
  60. 2023-05-02 08:12:16 INFO [__main__] Train data size: 1020
  61. 2023-05-02 08:12:16 INFO [__main__] Test data size: 10
  62. 2023-05-02 08:12:16 INFO [__main__] Processed dataset has 1030 rows after filtering for truncated records
  63. 2023-05-02 08:12:16 INFO [__main__] Shuffling dataset
  64. 2023-05-02 08:12:16 WARNING [datasets.arrow_dataset] Loading cached shuffled indices for dataset at /root/.cache/huggingface/datasets/opyate___parquet/opyate--mydataset-dolly15kFormat-noJSONSchema-66b5aa1793cd296a/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec/cache-2ef5f0cbbbb9df9d.arrow
  65. 2023-05-02 08:12:16 INFO [__main__] Done preprocessing
  66. 2023-05-02 08:12:16 WARNING [datasets.arrow_dataset] Loading cached split indices for dataset at /root/.cache/huggingface/datasets/opyate___parquet/opyate--mydataset-dolly15kFormat-noJSONSchema-66b5aa1793cd296a/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec/cache-8ed636b6f0cf0253.arrow and /root/.cache/huggingface/datasets/opyate___parquet/opyate--mydataset-dolly15kFormat-noJSONSchema-66b5aa1793cd296a/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec/cache-5b275d686d6d0049.arrow
  67. 2023-05-02 08:12:16 INFO [__main__] Train data size: 1020
  68. 2023-05-02 08:12:16 INFO [__main__] Test data size: 10
  69. 2023-05-02 08:12:16 INFO [__main__] Processed dataset has 1030 rows after filtering for truncated records
  70. 2023-05-02 08:12:16 INFO [__main__] Shuffling dataset
  71. 2023-05-02 08:12:16 WARNING [datasets.arrow_dataset] Loading cached shuffled indices for dataset at /root/.cache/huggingface/datasets/opyate___parquet/opyate--mydataset-dolly15kFormat-noJSONSchema-66b5aa1793cd296a/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec/cache-2ef5f0cbbbb9df9d.arrow
  72. 2023-05-02 08:12:16 INFO [__main__] Done preprocessing
  73. 2023-05-02 08:12:16 WARNING [datasets.arrow_dataset] Loading cached split indices for dataset at /root/.cache/huggingface/datasets/opyate___parquet/opyate--mydataset-dolly15kFormat-noJSONSchema-66b5aa1793cd296a/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec/cache-8ed636b6f0cf0253.arrow and /root/.cache/huggingface/datasets/opyate___parquet/opyate--mydataset-dolly15kFormat-noJSONSchema-66b5aa1793cd296a/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec/cache-5b275d686d6d0049.arrow
  74. 2023-05-02 08:12:16 INFO [__main__] Train data size: 1020
  75. 2023-05-02 08:12:16 INFO [__main__] Test data size: 10
  76. 2023-05-02 08:12:17 INFO [__main__] Processed dataset has 1030 rows after filtering for truncated records
  77. 2023-05-02 08:12:17 INFO [__main__] Shuffling dataset
  78. 2023-05-02 08:12:17 WARNING [datasets.arrow_dataset] Loading cached shuffled indices for dataset at /root/.cache/huggingface/datasets/opyate___parquet/opyate--mydataset-dolly15kFormat-noJSONSchema-66b5aa1793cd296a/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec/cache-2ef5f0cbbbb9df9d.arrow
  79. 2023-05-02 08:12:17 INFO [__main__] Done preprocessing
  80. 2023-05-02 08:12:17 WARNING [datasets.arrow_dataset] Loading cached split indices for dataset at /root/.cache/huggingface/datasets/opyate___parquet/opyate--mydataset-dolly15kFormat-noJSONSchema-66b5aa1793cd296a/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec/cache-8ed636b6f0cf0253.arrow and /root/.cache/huggingface/datasets/opyate___parquet/opyate--mydataset-dolly15kFormat-noJSONSchema-66b5aa1793cd296a/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec/cache-5b275d686d6d0049.arrow
  81. 2023-05-02 08:12:17 INFO [__main__] Train data size: 1020
  82. 2023-05-02 08:12:17 INFO [__main__] Test data size: 10
  83. [2023-05-02 08:12:17,247] [INFO] [comm.py:652:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
  84. 2023-05-02 08:12:18 INFO [torch.distributed.distributed_c10d] Added key: store_based_barrier_key:1 to store for rank: 3
  85. 2023-05-02 08:12:18 INFO [torch.distributed.distributed_c10d] Added key: store_based_barrier_key:1 to store for rank: 2
  86. 2023-05-02 08:12:18 INFO [torch.distributed.distributed_c10d] Added key: store_based_barrier_key:1 to store for rank: 1
  87. 2023-05-02 08:12:18 INFO [torch.distributed.distributed_c10d] Added key: store_based_barrier_key:1 to store for rank: 0
  88. 2023-05-02 08:12:18 INFO [torch.distributed.distributed_c10d] Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
  89. 2023-05-02 08:12:18 INFO [torch.distributed.distributed_c10d] Rank 3: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
  90. 2023-05-02 08:12:18 INFO [torch.distributed.distributed_c10d] Rank 2: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
  91. 2023-05-02 08:12:18 INFO [torch.distributed.distributed_c10d] Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
  92. 2023-05-02 08:12:18 INFO [__main__] Instantiating Trainer
  93. 2023-05-02 08:12:18 INFO [__main__] Instantiating Trainer
  94. 2023-05-02 08:12:18 INFO [__main__] Training
  95. 2023-05-02 08:12:18 INFO [__main__] Instantiating Trainer
  96. 2023-05-02 08:12:18 INFO [__main__] Instantiating Trainer
  97. 2023-05-02 08:12:18 INFO [__main__] Training
  98. 2023-05-02 08:12:18 INFO [__main__] Training
  99. 2023-05-02 08:12:18 INFO [__main__] Training
  100. 2023-05-02 08:12:26 INFO [torch.distributed.distributed_c10d] Added key: store_based_barrier_key:2 to store for rank: 0
  101. 2023-05-02 08:12:26 INFO [torch.distributed.distributed_c10d] Added key: store_based_barrier_key:2 to store for rank: 1
  102. 2023-05-02 08:12:26 INFO [torch.distributed.distributed_c10d] Added key: store_based_barrier_key:2 to store for rank: 2
  103. 2023-05-02 08:12:26 INFO [torch.distributed.distributed_c10d] Added key: store_based_barrier_key:2 to store for rank: 3
  104. 2023-05-02 08:12:26 INFO [torch.distributed.distributed_c10d] Rank 3: Completed store-based barrier for key:store_based_barrier_key:2 with 4 nodes.
  105. 2023-05-02 08:12:26 INFO [torch.distributed.distributed_c10d] Rank 0: Completed store-based barrier for key:store_based_barrier_key:2 with 4 nodes.
  106. 2023-05-02 08:12:26 INFO [torch.distributed.distributed_c10d] Rank 2: Completed store-based barrier for key:store_based_barrier_key:2 with 4 nodes.
  107. 2023-05-02 08:12:26 INFO [torch.distributed.distributed_c10d] Rank 1: Completed store-based barrier for key:store_based_barrier_key:2 with 4 nodes.
  108. Installed CUDA version 11.3 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
  109. Installed CUDA version 11.3 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
  110. Installed CUDA version 11.3 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
  111. Installed CUDA version 11.3 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
  112. Using /root/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
  113. Creating extension directory /root/.cache/torch_extensions/py39_cu117/cpu_adam...
  114. Using /root/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
  115. Using /root/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
  116. Using /root/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
  117. Detected CUDA files, patching ldflags
  118. Emitting ninja build file /root/.cache/torch_extensions/py39_cu117/cpu_adam/build.ninja...
  119. Building extension module cpu_adam...
  120. Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
  121. [1/3] /usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/local_disk0/.ephemeral_nfs/envs/pythonEnv-1938b734-28d5-4be1-b905-9dd6942415b9/lib/python3.9/site-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda/include -isystem /databricks/python/lib/python3.9/site-packages/torch/include -isystem /databricks/python/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /databricks/python/lib/python3.9/site-packages/torch/include/TH -isystem /databricks/python/lib/python3.9/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++14 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_86,code=sm_86 -gencode=arch=compute_86,code=compute_86 -c /local_disk0/.ephemeral_nfs/envs/pythonEnv-1938b734-28d5-4be1-b905-9dd6942415b9/lib/python3.9/site-packages/deepspeed/ops/csrc/common/custom_cuda_kernel.cu -o custom_cuda_kernel.cuda.o
  122. [2/3] c++ -MMD -MF cpu_adam.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/local_disk0/.ephemeral_nfs/envs/pythonEnv-1938b734-28d5-4be1-b905-9dd6942415b9/lib/python3.9/site-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda/include -isystem /databricks/python/lib/python3.9/site-packages/torch/include -isystem /databricks/python/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /databricks/python/lib/python3.9/site-packages/torch/include/TH -isystem /databricks/python/lib/python3.9/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -O3 -std=c++14 -g -Wno-reorder -L/usr/local/cuda/lib64 -lcudart -lcublas -g -march=native -fopenmp -D__AVX256__ -D__ENABLE_CUDA__ -c /local_disk0/.ephemeral_nfs/envs/pythonEnv-1938b734-28d5-4be1-b905-9dd6942415b9/lib/python3.9/site-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp -o cpu_adam.o
  123. [3/3] c++ cpu_adam.o custom_cuda_kernel.cuda.o -shared -lcurand -L/databricks/python/lib/python3.9/site-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda_cu -ltorch_cuda_cpp -ltorch -ltorch_python -L/usr/local/cuda/lib64 -lcudart -o cpu_adam.so
  124. Loading extension module cpu_adam...
  125. Time to load cpu_adam op: 32.19707775115967 seconds
  126. Loading extension module cpu_adam...
  127. Time to load cpu_adam op: 32.255900621414185 seconds
  128. Loading extension module cpu_adam...
  129. Time to load cpu_adam op: 32.25137495994568 seconds
  130. Loading extension module cpu_adam...
  131. Time to load cpu_adam op: 32.25928974151611 seconds
  132. Using /root/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
  133. Creating extension directory /root/.cache/torch_extensions/py39_cu117/utils...
  134. Using /root/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
  135. Using /root/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
  136. Using /root/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
  137. Emitting ninja build file /root/.cache/torch_extensions/py39_cu117/utils/build.ninja...
  138. Building extension module utils...
  139. Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
  140. [1/2] c++ -MMD -MF flatten_unflatten.o.d -DTORCH_EXTENSION_NAME=utils -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /databricks/python/lib/python3.9/site-packages/torch/include -isystem /databricks/python/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /databricks/python/lib/python3.9/site-packages/torch/include/TH -isystem /databricks/python/lib/python3.9/site-packages/torch/include/THC -isystem /usr/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -c /local_disk0/.ephemeral_nfs/envs/pythonEnv-1938b734-28d5-4be1-b905-9dd6942415b9/lib/python3.9/site-packages/deepspeed/ops/csrc/utils/flatten_unflatten.cpp -o flatten_unflatten.o
  141. [2/2] c++ flatten_unflatten.o -shared -L/databricks/python/lib/python3.9/site-packages/torch/lib -lc10 -ltorch_cpu -ltorch -ltorch_python -o utils.so
  142. Loading extension module utils...
  143. Time to load utils op: 16.357932806015015 seconds
  144. Loading extension module utils...
  145. Time to load utils op: 16.22466206550598 seconds
  146. Loading extension module utils...
  147. Loading extension module utils...
  148. Time to load utils op: 16.424213886260986 seconds
  149. Time to load utils op: 16.423726320266724 seconds
  150. Parameter Offload: Total persistent parameters: 1712128 in 258 params
  151. Using /root/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...Using /root/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
  152.  
  153. Using /root/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
  154. No modifications detected for re-loaded extension module utils, skipping build step...No modifications detected for re-loaded extension module utils, skipping build step...
  155. No modifications detected for re-loaded extension module utils, skipping build step...
  156. Loading extension module utils...
  157. Loading extension module utils...
  158. Loading extension module utils...
  159.  
  160. Time to load utils op: 0.0028858184814453125 secondsTime to load utils op: 0.002877473831176758 seconds
  161.  
  162. Time to load utils op: 0.0028963088989257812 seconds
  163. You're using a GPTNeoXTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
  164. You're using a GPTNeoXTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
  165. You're using a GPTNeoXTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
  166. Using /root/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
  167. No modifications detected for re-loaded extension module utils, skipping build step...
  168. Loading extension module utils...
  169. Time to load utils op: 0.0003731250762939453 seconds
  170. /databricks/python/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:2387: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
  171. warnings.warn(
  172. /databricks/python/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:2387: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
  173. warnings.warn(
  174. /databricks/python/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:2387: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
  175. warnings.warn(
  176. You're using a GPTNeoXTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
  177. /databricks/python/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:2387: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
  178. warnings.warn(
  179. /databricks/python/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:2849: UserWarning: torch.distributed._reduce_scatter_base is a private function and will be deprecated. Please use torch.distributed.reduce_scatter_tensor instead.
  180. warnings.warn(
  181. /databricks/python/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:2849: UserWarning: torch.distributed._reduce_scatter_base is a private function and will be deprecated. Please use torch.distributed.reduce_scatter_tensor instead.
  182. warnings.warn(
  183. /databricks/python/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:2849: UserWarning: torch.distributed._reduce_scatter_base is a private function and will be deprecated. Please use torch.distributed.reduce_scatter_tensor instead.
  184. warnings.warn(
  185. /databricks/python/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:2849: UserWarning: torch.distributed._reduce_scatter_base is a private function and will be deprecated. Please use torch.distributed.reduce_scatter_tensor instead.
  186. warnings.warn(
  187. {'loss': 1.6059, 'learning_rate': 2.942959550338895e-08, 'epoch': 0.12}
  188. {'loss': 1.6195, 'learning_rate': 3.828878651016684e-08, 'epoch': 0.24}
  189. {'loss': 1.6062, 'learning_rate': 4.3471081035858023e-08, 'epoch': 0.35}
  190. {'loss': 1.6287, 'learning_rate': 4.714797751694474e-08, 'epoch': 0.47}
  191. {'loss': 1.6036, 'learning_rate': 5e-08, 'epoch': 0.59}
  192. {'eval_loss': 1.7531249523162842, 'eval_runtime': 3.6032, 'eval_samples_per_second': 2.775, 'eval_steps_per_second': 0.278, 'epoch': 0.59}
  193. [2023-05-02 08:30:59,766] [WARNING] [stage3.py:1942:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
  194. {'loss': 1.5223, 'learning_rate': 5e-08, 'epoch': 0.71}
  195. {'loss': 1.4069, 'learning_rate': 5e-08, 'epoch': 0.82}
  196. {'loss': 1.5647, 'learning_rate': 5e-08, 'epoch': 0.94}
  197. {'loss': 1.3912, 'learning_rate': 5e-08, 'epoch': 1.06}
  198. {'loss': 1.4701, 'learning_rate': 5e-08, 'epoch': 1.18}
  199. {'eval_loss': 1.545312523841858, 'eval_runtime': 3.1315, 'eval_samples_per_second': 3.193, 'eval_steps_per_second': 0.319, 'epoch': 1.18}
  200. {'loss': 1.2466, 'learning_rate': 5e-08, 'epoch': 1.29}
  201. {'loss': 1.2768, 'learning_rate': 5e-08, 'epoch': 1.41}
  202. {'loss': 1.2441, 'learning_rate': 5e-08, 'epoch': 1.53}
  203. [2023-05-02 08:54:54,214] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 2270
  204. [2023-05-02 08:54:57,391] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 2271
  205. [2023-05-02 08:55:00,607] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 2272
  206. [2023-05-02 08:55:00,608] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 2273
  207. [2023-05-02 08:55:03,863] [ERROR] [launch.py:324:sigkill_handler] ['/local_disk0/.ephemeral_nfs/envs/pythonEnv-1938b734-28d5-4be1-b905-9dd6942415b9/bin/python', '-u', '-m', 'training.trainer', '--local_rank=3', '--input-model', 'Databricks/dolly-v2-7b', '--deepspeed', '/Workspace/Repos/opyate@gmail.com/dolly/config/ds_z3_bf16_config.json', '--epochs', '2', '--local-output-dir', '/local_disk0/dolly_training/dolly_mydataset-dolly15kFormat-noJSONSchema__2023-05-02T08:07:50', '--dbfs-output-dir', '/dbfs/dolly_training/dolly_mydataset-dolly15kFormat-noJSONSchema__2023-05-02T08:07:50', '--per-device-train-batch-size', '3', '--per-device-eval-batch-size', '3', '--logging-steps', '10', '--save-steps', '200', '--save-total-limit', '20', '--eval-steps', '50', '--warmup-steps', '50', '--test-size', '10', '--lr', '5e-08'] exits with return code = -9
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement