Advertisement
Guest User

Untitled

a guest
May 2nd, 2023
17
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 27.96 KB | None | 0 0
  1. [2023-05-02 09:20:39,837] [WARNING] [runner.py:186:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
  2. [2023-05-02 09:20:39,864] [INFO] [runner.py:550:main] cmd = /local_disk0/.ephemeral_nfs/envs/pythonEnv-e1f29474-0171-420e-8b03-5f4b03ffc5cd/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgM119 --master_addr=127.0.0.1 --master_port=29500 --module --enable_each_rank_log=None training.trainer --input-model Databricks/dolly-v2-7b --deepspeed /Workspace/Repos/opyate@gmail.com/dolly/config/ds_z3_bf16_config.json --epochs 2 --local-output-dir /local_disk0/dolly_training/dolly_mydataset-dolly15kFormat-noJSONSchema__2023-05-02T09:20:32 --dbfs-output-dir /dbfs/dolly_training/dolly_mydataset-dolly15kFormat-noJSONSchema__2023-05-02T09:20:32 --per-device-train-batch-size 3 --per-device-eval-batch-size 3 --logging-steps 10 --save-steps 200 --save-total-limit 20 --eval-steps 50 --warmup-steps 50 --test-size 10 --lr 5e-07
  3. [2023-05-02 09:20:43,216] [INFO] [launch.py:142:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3]}
  4. [2023-05-02 09:20:43,216] [INFO] [launch.py:148:main] nnodes=1, num_local_procs=4, node_rank=0
  5. [2023-05-02 09:20:43,216] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3]})
  6. [2023-05-02 09:20:43,217] [INFO] [launch.py:162:main] dist_world_size=4
  7. [2023-05-02 09:20:43,217] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3
  8. 2023-05-02 09:20:45.707091: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
  9. To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
  10. 2023-05-02 09:20:45.735839: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
  11. To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
  12. 2023-05-02 09:20:45.756864: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
  13. To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
  14. 2023-05-02 09:20:45.778718: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
  15. To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
  16. 2023-05-02 09:20:53 INFO [__main__] Loading tokenizer for Databricks/dolly-v2-7b
  17. 2023-05-02 09:20:53 INFO [__main__] Loading tokenizer for Databricks/dolly-v2-7b
  18. 2023-05-02 09:20:53 INFO [__main__] Loading tokenizer for Databricks/dolly-v2-7b
  19. 2023-05-02 09:20:53 INFO [__main__] Loading tokenizer for Databricks/dolly-v2-7b
  20. 2023-05-02 09:20:54 INFO [__main__] Loading model for Databricks/dolly-v2-7b
  21. 2023-05-02 09:20:54 INFO [__main__] Loading model for Databricks/dolly-v2-7b
  22. 2023-05-02 09:20:54 INFO [__main__] Loading model for Databricks/dolly-v2-7b
  23. 2023-05-02 09:20:54 INFO [__main__] Loading model for Databricks/dolly-v2-7b
  24. 2023-05-02 09:24:48 INFO [__main__] Found max lenth: 2048
  25. 2023-05-02 09:24:48 INFO [__main__] Checking if dataset is specific via env var DATASET_NAME
  26. 2023-05-02 09:24:48 INFO [__main__] Yes: opyate/mydataset-dolly15kFormat-noJSONSchema
  27. 2023-05-02 09:24:49 WARNING [datasets.builder] Found cached dataset parquet (/root/.cache/huggingface/datasets/opyate___parquet/opyate--mydataset-dolly15kFormat-noJSONSchema-66b5aa1793cd296a/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)
  28. 100%|████████████████████████████████████████████| 1/1 [00:00<00:00, 200.42it/s]
  29. 2023-05-02 09:24:49 INFO [__main__] Found 1032 rows
  30. 2023-05-02 09:24:50 INFO [__main__] Preprocessing dataset
  31. 2023-05-02 09:24:50 INFO [__main__] Found max lenth: 2048
  32. 2023-05-02 09:24:50 INFO [__main__] Checking if dataset is specific via env var DATASET_NAME
  33. 2023-05-02 09:24:50 INFO [__main__] Yes: opyate/mydataset-dolly15kFormat-noJSONSchema
  34. Map: 0%| | 0/1032 [00:00<?, ? examples/s]2023-05-02 09:24:50 INFO [__main__] Found max lenth: 2048
  35. 2023-05-02 09:24:50 INFO [__main__] Checking if dataset is specific via env var DATASET_NAME
  36. 2023-05-02 09:24:50 INFO [__main__] Yes: opyate/mydataset-dolly15kFormat-noJSONSchema
  37. 2023-05-02 09:24:50 INFO [__main__] Found max lenth: 2048
  38. 2023-05-02 09:24:50 INFO [__main__] Checking if dataset is specific via env var DATASET_NAME
  39. 2023-05-02 09:24:50 INFO [__main__] Yes: opyate/mydataset-dolly15kFormat-noJSONSchema
  40. 2023-05-02 09:24:51 WARNING [datasets.builder] Found cached dataset parquet (/root/.cache/huggingface/datasets/opyate___parquet/opyate--mydataset-dolly15kFormat-noJSONSchema-66b5aa1793cd296a/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)
  41. 100%|████████████████████████████████████████████| 1/1 [00:00<00:00, 556.05it/s]
  42. 2023-05-02 09:24:51 INFO [__main__] Found 1032 rows
  43. 2023-05-02 09:24:51 WARNING [datasets.arrow_dataset] Loading cached processed dataset at /root/.cache/huggingface/datasets/opyate___parquet/opyate--mydataset-dolly15kFormat-noJSONSchema-66b5aa1793cd296a/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec/cache-373a08c6145e0816.arrow
  44. 2023-05-02 09:24:51 INFO [__main__] Preprocessing dataset
  45. 2023-05-02 09:24:51 WARNING [datasets.builder] Found cached dataset parquet (/root/.cache/huggingface/datasets/opyate___parquet/opyate--mydataset-dolly15kFormat-noJSONSchema-66b5aa1793cd296a/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)
  46. 100%|████████████████████████████████████████████| 1/1 [00:00<00:00, 505.09it/s]
  47. 2023-05-02 09:24:51 INFO [__main__] Found 1032 rows
  48. 2023-05-02 09:24:51 WARNING [datasets.arrow_dataset] Loading cached processed dataset at /root/.cache/huggingface/datasets/opyate___parquet/opyate--mydataset-dolly15kFormat-noJSONSchema-66b5aa1793cd296a/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec/cache-373a08c6145e0816.arrow
  49. 2023-05-02 09:24:51 INFO [__main__] Preprocessing dataset
  50. Map: 0%| | 0/1032 [00:00<?, ? examples/s]2023-05-02 09:24:51 WARNING [datasets.builder] Found cached dataset parquet (/root/.cache/huggingface/datasets/opyate___parquet/opyate--mydataset-dolly15kFormat-noJSONSchema-66b5aa1793cd296a/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)
  51. 100%|████████████████████████████████████████████| 1/1 [00:00<00:00, 559.61it/s]
  52. 2023-05-02 09:24:51 INFO [__main__] Found 1032 rows
  53. 2023-05-02 09:24:51 WARNING [datasets.arrow_dataset] Loading cached processed dataset at /root/.cache/huggingface/datasets/opyate___parquet/opyate--mydataset-dolly15kFormat-noJSONSchema-66b5aa1793cd296a/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec/cache-373a08c6145e0816.arrow
  54. 2023-05-02 09:24:51 INFO [__main__] Preprocessing dataset
  55. 2023-05-02 09:24:51 INFO [__main__] Processed dataset has 1032 rows
  56. 2023-05-02 09:24:52 INFO [__main__] Processed dataset has 1030 rows after filtering for truncated records
  57. 2023-05-02 09:24:52 INFO [__main__] Shuffling dataset
  58. 2023-05-02 09:24:52 INFO [__main__] Done preprocessing
  59. 2023-05-02 09:24:52 INFO [__main__] Train data size: 1020
  60. 2023-05-02 09:24:52 INFO [__main__] Test data size: 10
  61. 2023-05-02 09:24:52 INFO [__main__] Processed dataset has 1032 rows
  62. 2023-05-02 09:24:52 WARNING [datasets.arrow_dataset] Loading cached processed dataset at /root/.cache/huggingface/datasets/opyate___parquet/opyate--mydataset-dolly15kFormat-noJSONSchema-66b5aa1793cd296a/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec/cache-462aabd08dd4ec96.arrow
  63. 2023-05-02 09:24:52 INFO [__main__] Processed dataset has 1030 rows after filtering for truncated records
  64. 2023-05-02 09:24:52 INFO [__main__] Shuffling dataset
  65. 2023-05-02 09:24:52 WARNING [datasets.arrow_dataset] Loading cached shuffled indices for dataset at /root/.cache/huggingface/datasets/opyate___parquet/opyate--mydataset-dolly15kFormat-noJSONSchema-66b5aa1793cd296a/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec/cache-2ef5f0cbbbb9df9d.arrow
  66. 2023-05-02 09:24:52 INFO [__main__] Done preprocessing
  67. 2023-05-02 09:24:52 WARNING [datasets.arrow_dataset] Loading cached split indices for dataset at /root/.cache/huggingface/datasets/opyate___parquet/opyate--mydataset-dolly15kFormat-noJSONSchema-66b5aa1793cd296a/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec/cache-8ed636b6f0cf0253.arrow and /root/.cache/huggingface/datasets/opyate___parquet/opyate--mydataset-dolly15kFormat-noJSONSchema-66b5aa1793cd296a/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec/cache-5b275d686d6d0049.arrow
  68. 2023-05-02 09:24:52 INFO [__main__] Train data size: 1020
  69. 2023-05-02 09:24:52 INFO [__main__] Test data size: 10
  70. 2023-05-02 09:24:52 INFO [__main__] Processed dataset has 1032 rows
  71. 2023-05-02 09:24:52 WARNING [datasets.arrow_dataset] Loading cached processed dataset at /root/.cache/huggingface/datasets/opyate___parquet/opyate--mydataset-dolly15kFormat-noJSONSchema-66b5aa1793cd296a/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec/cache-462aabd08dd4ec96.arrow
  72. 2023-05-02 09:24:52 INFO [__main__] Processed dataset has 1030 rows after filtering for truncated records
  73. 2023-05-02 09:24:52 INFO [__main__] Shuffling dataset
  74. 2023-05-02 09:24:52 WARNING [datasets.arrow_dataset] Loading cached shuffled indices for dataset at /root/.cache/huggingface/datasets/opyate___parquet/opyate--mydataset-dolly15kFormat-noJSONSchema-66b5aa1793cd296a/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec/cache-2ef5f0cbbbb9df9d.arrow
  75. 2023-05-02 09:24:52 INFO [__main__] Done preprocessing
  76. 2023-05-02 09:24:52 WARNING [datasets.arrow_dataset] Loading cached split indices for dataset at /root/.cache/huggingface/datasets/opyate___parquet/opyate--mydataset-dolly15kFormat-noJSONSchema-66b5aa1793cd296a/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec/cache-8ed636b6f0cf0253.arrow and /root/.cache/huggingface/datasets/opyate___parquet/opyate--mydataset-dolly15kFormat-noJSONSchema-66b5aa1793cd296a/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec/cache-5b275d686d6d0049.arrow
  77. 2023-05-02 09:24:52 INFO [__main__] Train data size: 1020
  78. 2023-05-02 09:24:52 INFO [__main__] Test data size: 10
  79. 2023-05-02 09:24:52 INFO [__main__] Processed dataset has 1032 rows
  80. 2023-05-02 09:24:52 WARNING [datasets.arrow_dataset] Loading cached processed dataset at /root/.cache/huggingface/datasets/opyate___parquet/opyate--mydataset-dolly15kFormat-noJSONSchema-66b5aa1793cd296a/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec/cache-462aabd08dd4ec96.arrow
  81. 2023-05-02 09:24:52 INFO [__main__] Processed dataset has 1030 rows after filtering for truncated records
  82. 2023-05-02 09:24:52 INFO [__main__] Shuffling dataset
  83. 2023-05-02 09:24:52 WARNING [datasets.arrow_dataset] Loading cached shuffled indices for dataset at /root/.cache/huggingface/datasets/opyate___parquet/opyate--mydataset-dolly15kFormat-noJSONSchema-66b5aa1793cd296a/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec/cache-2ef5f0cbbbb9df9d.arrow
  84. 2023-05-02 09:24:52 INFO [__main__] Done preprocessing
  85. 2023-05-02 09:24:52 WARNING [datasets.arrow_dataset] Loading cached split indices for dataset at /root/.cache/huggingface/datasets/opyate___parquet/opyate--mydataset-dolly15kFormat-noJSONSchema-66b5aa1793cd296a/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec/cache-8ed636b6f0cf0253.arrow and /root/.cache/huggingface/datasets/opyate___parquet/opyate--mydataset-dolly15kFormat-noJSONSchema-66b5aa1793cd296a/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec/cache-5b275d686d6d0049.arrow
  86. 2023-05-02 09:24:52 INFO [__main__] Train data size: 1020
  87. 2023-05-02 09:24:52 INFO [__main__] Test data size: 10
  88. [2023-05-02 09:24:52,838] [INFO] [comm.py:652:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
  89. 2023-05-02 09:24:52 INFO [torch.distributed.distributed_c10d] Added key: store_based_barrier_key:1 to store for rank: 3
  90. 2023-05-02 09:24:53 INFO [torch.distributed.distributed_c10d] Added key: store_based_barrier_key:1 to store for rank: 2
  91. 2023-05-02 09:24:53 INFO [torch.distributed.distributed_c10d] Added key: store_based_barrier_key:1 to store for rank: 1
  92. 2023-05-02 09:24:53 INFO [torch.distributed.distributed_c10d] Added key: store_based_barrier_key:1 to store for rank: 0
  93. 2023-05-02 09:24:53 INFO [torch.distributed.distributed_c10d] Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
  94. 2023-05-02 09:24:53 INFO [torch.distributed.distributed_c10d] Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
  95. 2023-05-02 09:24:53 INFO [torch.distributed.distributed_c10d] Rank 3: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
  96. 2023-05-02 09:24:53 INFO [torch.distributed.distributed_c10d] Rank 2: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
  97. 2023-05-02 09:24:54 INFO [__main__] Instantiating Trainer
  98. 2023-05-02 09:24:54 INFO [__main__] Instantiating Trainer
  99. 2023-05-02 09:24:54 INFO [__main__] Instantiating Trainer
  100. 2023-05-02 09:24:54 INFO [__main__] Instantiating Trainer
  101. 2023-05-02 09:24:54 INFO [__main__] Training
  102. 2023-05-02 09:24:54 INFO [__main__] Training
  103. 2023-05-02 09:24:54 INFO [__main__] Training
  104. 2023-05-02 09:24:54 INFO [__main__] Training
  105. 2023-05-02 09:25:02 INFO [torch.distributed.distributed_c10d] Added key: store_based_barrier_key:2 to store for rank: 1
  106. 2023-05-02 09:25:02 INFO [torch.distributed.distributed_c10d] Added key: store_based_barrier_key:2 to store for rank: 0
  107. 2023-05-02 09:25:02 INFO [torch.distributed.distributed_c10d] Added key: store_based_barrier_key:2 to store for rank: 2
  108. 2023-05-02 09:25:02 INFO [torch.distributed.distributed_c10d] Added key: store_based_barrier_key:2 to store for rank: 3
  109. 2023-05-02 09:25:02 INFO [torch.distributed.distributed_c10d] Rank 3: Completed store-based barrier for key:store_based_barrier_key:2 with 4 nodes.
  110. 2023-05-02 09:25:02 INFO [torch.distributed.distributed_c10d] Rank 1: Completed store-based barrier for key:store_based_barrier_key:2 with 4 nodes.
  111. 2023-05-02 09:25:02 INFO [torch.distributed.distributed_c10d] Rank 0: Completed store-based barrier for key:store_based_barrier_key:2 with 4 nodes.
  112. 2023-05-02 09:25:02 INFO [torch.distributed.distributed_c10d] Rank 2: Completed store-based barrier for key:store_based_barrier_key:2 with 4 nodes.
  113. Installed CUDA version 11.3 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
  114. Installed CUDA version 11.3 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
  115. Installed CUDA version 11.3 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
  116. Installed CUDA version 11.3 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
  117. Using /root/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
  118. Creating extension directory /root/.cache/torch_extensions/py39_cu117/cpu_adam...
  119. Using /root/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
  120. Using /root/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
  121. Using /root/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
  122. Detected CUDA files, patching ldflags
  123. Emitting ninja build file /root/.cache/torch_extensions/py39_cu117/cpu_adam/build.ninja...
  124. Building extension module cpu_adam...
  125. Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
  126. [1/3] /usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/local_disk0/.ephemeral_nfs/envs/pythonEnv-e1f29474-0171-420e-8b03-5f4b03ffc5cd/lib/python3.9/site-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda/include -isystem /databricks/python/lib/python3.9/site-packages/torch/include -isystem /databricks/python/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /databricks/python/lib/python3.9/site-packages/torch/include/TH -isystem /databricks/python/lib/python3.9/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++14 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_86,code=sm_86 -gencode=arch=compute_86,code=compute_86 -c /local_disk0/.ephemeral_nfs/envs/pythonEnv-e1f29474-0171-420e-8b03-5f4b03ffc5cd/lib/python3.9/site-packages/deepspeed/ops/csrc/common/custom_cuda_kernel.cu -o custom_cuda_kernel.cuda.o
  127. [2/3] c++ -MMD -MF cpu_adam.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/local_disk0/.ephemeral_nfs/envs/pythonEnv-e1f29474-0171-420e-8b03-5f4b03ffc5cd/lib/python3.9/site-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda/include -isystem /databricks/python/lib/python3.9/site-packages/torch/include -isystem /databricks/python/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /databricks/python/lib/python3.9/site-packages/torch/include/TH -isystem /databricks/python/lib/python3.9/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -O3 -std=c++14 -g -Wno-reorder -L/usr/local/cuda/lib64 -lcudart -lcublas -g -march=native -fopenmp -D__AVX256__ -D__ENABLE_CUDA__ -c /local_disk0/.ephemeral_nfs/envs/pythonEnv-e1f29474-0171-420e-8b03-5f4b03ffc5cd/lib/python3.9/site-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp -o cpu_adam.o
  128. [3/3] c++ cpu_adam.o custom_cuda_kernel.cuda.o -shared -lcurand -L/databricks/python/lib/python3.9/site-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda_cu -ltorch_cuda_cpp -ltorch -ltorch_python -L/usr/local/cuda/lib64 -lcudart -o cpu_adam.so
  129. Loading extension module cpu_adam...
  130. Time to load cpu_adam op: 33.10828900337219 seconds
  131. Loading extension module cpu_adam...
  132. Time to load cpu_adam op: 33.150460958480835 seconds
  133. Loading extension module cpu_adam...Loading extension module cpu_adam...
  134.  
  135. Time to load cpu_adam op: 33.173842906951904 secondsTime to load cpu_adam op: 33.16837549209595 seconds
  136.  
  137. Using /root/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
  138. Creating extension directory /root/.cache/torch_extensions/py39_cu117/utils...
  139. Using /root/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
  140. Using /root/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
  141. Using /root/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
  142. Emitting ninja build file /root/.cache/torch_extensions/py39_cu117/utils/build.ninja...
  143. Building extension module utils...
  144. Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
  145. [1/2] c++ -MMD -MF flatten_unflatten.o.d -DTORCH_EXTENSION_NAME=utils -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /databricks/python/lib/python3.9/site-packages/torch/include -isystem /databricks/python/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /databricks/python/lib/python3.9/site-packages/torch/include/TH -isystem /databricks/python/lib/python3.9/site-packages/torch/include/THC -isystem /usr/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -c /local_disk0/.ephemeral_nfs/envs/pythonEnv-e1f29474-0171-420e-8b03-5f4b03ffc5cd/lib/python3.9/site-packages/deepspeed/ops/csrc/utils/flatten_unflatten.cpp -o flatten_unflatten.o
  146. [2/2] c++ flatten_unflatten.o -shared -L/databricks/python/lib/python3.9/site-packages/torch/lib -lc10 -ltorch_cpu -ltorch -ltorch_python -o utils.so
  147. Loading extension module utils...
  148. Time to load utils op: 15.778066158294678 seconds
  149. Loading extension module utils...
  150. Time to load utils op: 15.824144840240479 seconds
  151. Loading extension module utils...
  152. Time to load utils op: 15.824395895004272 seconds
  153. Loading extension module utils...
  154. Time to load utils op: 15.723725318908691 seconds
  155. Parameter Offload: Total persistent parameters: 1712128 in 258 params
  156. Using /root/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...Using /root/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
  157.  
  158. Using /root/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
  159. No modifications detected for re-loaded extension module utils, skipping build step...No modifications detected for re-loaded extension module utils, skipping build step...
  160.  
  161. Loading extension module utils...
  162. Loading extension module utils...
  163. No modifications detected for re-loaded extension module utils, skipping build step...
  164. Loading extension module utils...
  165. Time to load utils op: 0.003479480743408203 seconds
  166. Time to load utils op: 0.0035200119018554688 seconds
  167. Time to load utils op: 0.0037059783935546875 seconds
  168. You're using a GPTNeoXTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
  169. You're using a GPTNeoXTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
  170. You're using a GPTNeoXTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
  171. Using /root/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
  172. No modifications detected for re-loaded extension module utils, skipping build step...
  173. Loading extension module utils...
  174. Time to load utils op: 0.0003936290740966797 seconds
  175. /databricks/python/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:2387: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
  176. warnings.warn(
  177. /databricks/python/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:2387: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
  178. warnings.warn(
  179. /databricks/python/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:2387: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
  180. warnings.warn(
  181. You're using a GPTNeoXTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
  182. /databricks/python/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:2387: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
  183. warnings.warn(
  184. /databricks/python/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:2849: UserWarning: torch.distributed._reduce_scatter_base is a private function and will be deprecated. Please use torch.distributed.reduce_scatter_tensor instead.
  185. warnings.warn(
  186. /databricks/python/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:2849: UserWarning: torch.distributed._reduce_scatter_base is a private function and will be deprecated. Please use torch.distributed.reduce_scatter_tensor instead.
  187. warnings.warn(
  188. /databricks/python/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:2849: UserWarning: torch.distributed._reduce_scatter_base is a private function and will be deprecated. Please use torch.distributed.reduce_scatter_tensor instead.
  189. warnings.warn(
  190. /databricks/python/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:2849: UserWarning: torch.distributed._reduce_scatter_base is a private function and will be deprecated. Please use torch.distributed.reduce_scatter_tensor instead.
  191. warnings.warn(
  192. {'loss': 1.6039, 'learning_rate': 2.942959550338895e-07, 'epoch': 0.12}
  193. {'loss': 1.4676, 'learning_rate': 3.828878651016684e-07, 'epoch': 0.24}
  194. {'loss': 1.0807, 'learning_rate': 4.347108103585802e-07, 'epoch': 0.35}
  195. {'loss': 0.7331, 'learning_rate': 4.7147977516944737e-07, 'epoch': 0.47}
  196. {'loss': 0.4184, 'learning_rate': 5e-07, 'epoch': 0.59}
  197. {'eval_loss': 0.6129394769668579, 'eval_runtime': 3.6325, 'eval_samples_per_second': 2.753, 'eval_steps_per_second': 0.275, 'epoch': 0.59}
  198. [2023-05-02 09:43:47,839] [WARNING] [stage3.py:1942:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
  199. {'loss': 0.2071, 'learning_rate': 5e-07, 'epoch': 0.71}
  200. {'loss': 0.1072, 'learning_rate': 5e-07, 'epoch': 0.82}
  201. {'loss': 0.1105, 'learning_rate': 5e-07, 'epoch': 0.94}
  202. {'loss': 0.077, 'learning_rate': 5e-07, 'epoch': 1.06}
  203. {'loss': 0.1277, 'learning_rate': 5e-07, 'epoch': 1.18}
  204. {'eval_loss': 0.398681640625, 'eval_runtime': 3.1007, 'eval_samples_per_second': 3.225, 'eval_steps_per_second': 0.323, 'epoch': 1.18}
  205. {'loss': 0.0476, 'learning_rate': 5e-07, 'epoch': 1.29}
  206. [2023-05-02 10:02:12,287] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 2762
  207. [2023-05-02 10:02:15,511] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 2763
  208. [2023-05-02 10:02:18,768] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 2764
  209. [2023-05-02 10:02:18,768] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 2765
  210. [2023-05-02 10:02:22,024] [ERROR] [launch.py:324:sigkill_handler] ['/local_disk0/.ephemeral_nfs/envs/pythonEnv-e1f29474-0171-420e-8b03-5f4b03ffc5cd/bin/python', '-u', '-m', 'training.trainer', '--local_rank=3', '--input-model', 'Databricks/dolly-v2-7b', '--deepspeed', '/Workspace/Repos/opyate@gmail.com/dolly/config/ds_z3_bf16_config.json', '--epochs', '2', '--local-output-dir', '/local_disk0/dolly_training/dolly_mydataset-dolly15kFormat-noJSONSchema__2023-05-02T09:20:32', '--dbfs-output-dir', '/dbfs/dolly_training/dolly_mydataset-dolly15kFormat-noJSONSchema__2023-05-02T09:20:32', '--per-device-train-batch-size', '3', '--per-device-eval-batch-size', '3', '--logging-steps', '10', '--save-steps', '200', '--save-total-limit', '20', '--eval-steps', '50', '--warmup-steps', '50', '--test-size', '10', '--lr', '5e-07'] exits with return code = -9
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement