stage3cpu

; parser.add_argument("--strategy", default=DeepSpeedStrategy(
;                                                           stage=3,
;                                                           offload_optimizer=True,
;                                                           offload_parameters=False,
;                                                           logging_level="INFO",
;                                                       ))
Global seed set to 8653745
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/home/neil/.pyvenv/ml/lib/python3.8/site-packages/pytorch_lightning/trainer/configuration_validator.py:131: UserWarning: You passed in a `val_dataloader` but have no `validation_step`. Skipping val loop.
  rank_zero_warn("You passed in a `val_dataloader` but have no `validation_step`. Skipping val loop.")
/home/neil/.pyvenv/ml/lib/python3.8/site-packages/pytorch_lightning/trainer/configuration_validator.py:412: LightningDeprecationWarning: `LightningDataModule.on_save_checkpoint` was deprecated in v1.6 and will be removed in v1.8. Use `state_dict` instead.
  rank_zero_deprecation(
/home/neil/.pyvenv/ml/lib/python3.8/site-packages/pytorch_lightning/trainer/configuration_validator.py:417: LightningDeprecationWarning: `LightningDataModule.on_load_checkpoint` was deprecated in v1.6 and will be removed in v1.8. Use `load_state_dict` instead.
  rank_zero_deprecation(
Global seed set to 8653745
initializing deepspeed distributed: GLOBAL_RANK: 0, MEMBER: 1/1
[2022-07-10 10:52:56,013] [INFO] [distributed.py:48:init_distributed] Initializing torch distributed with backend: nccl
[2022-07-10 10:52:56,015] [WARNING] [deepspeed.py:647:_auto_select_batch_size] Tried to infer the batch size for internal deepspeed logging from the `train_dataloader()`. To ensure DeepSpeed logging remains correct, please manually pass the plugin with the batch size, `Trainer(strategy=DeepSpeedStrategy(logging_batch_size_per_gpu=batch_size))`.
Reusing dataset wikitext (/home/neil/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126)
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 648.34it/s]
Parameter 'function'=<function Dataset.map.<locals>.decorate.<locals>.decorated at 0x7f4be7a3b790> of the transform datasets.arrow_dataset.Dataset._map_single couldn't be hashed properly, a random hash was used instead. Make sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and caching to work. If you reuse this transform, the caching mechanism will consider it to be different from the previous calls and recompute everything. This warning is only showed once. Subsequent hashing failures won't be showed.
Loading cached processed dataset at /home/neil/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126/cache-8d4c9428789cfa50.arrow
Loading cached processed dataset at /home/neil/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126/cache-1a6d3236afea204a.arrow
Loading cached processed dataset at /home/neil/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126/cache-ff14772e12a6fc92.arrow
Loading cached processed dataset at /home/neil/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126/cache-3efdba240770b126.arrow
Loading cached processed dataset at /home/neil/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126/cache-090d3d24d784f74e.arrow
Loading cached processed dataset at /home/neil/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126/cache-e0650e6b0992b455.arrow
Estimated memory needed for params, optim states and gradients for a:
HW: Setup with 1 node, 1 GPU per node.
SW: Model with 2651M total params, 128M largest layer params.
  per CPU  |  per GPU |   Options
   66.67GB |   0.48GB | offload_param=cpu , offload_optimizer=cpu , zero_init=1
   66.67GB |   0.48GB | offload_param=cpu , offload_optimizer=cpu , zero_init=0
   59.26GB |   5.42GB | offload_param=none, offload_optimizer=cpu , zero_init=1
   59.26GB |   5.42GB | offload_param=none, offload_optimizer=cpu , zero_init=0
    0.72GB |  44.93GB | offload_param=none, offload_optimizer=none, zero_init=1
   14.82GB |  44.93GB | offload_param=none, offload_optimizer=none, zero_init=0
[2022-07-10 10:52:58,683] [INFO] [utils.py:828:see_memory_usage] after setup
[2022-07-10 10:52:58,683] [INFO] [utils.py:829:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 0.0 GB         Max_CA 0 GB
[2022-07-10 10:52:58,683] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 16.12 GB, percent = 25.7%
[2022-07-10 10:52:58,687] [INFO] [partition_parameters.py:463:__exit__] finished initializing model with 0.00B parameters
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Using /home/neil/.cache/torch_extensions/py38_cu116 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/neil/.cache/torch_extensions/py38_cu116/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 3.200301170349121 seconds
Adam Optimizer #0 is created with AVX2 arithmetic capability.
Config: alpha=0.001000, betas=(0.900000, 0.999000), weight_decay=0.000500, adam_w=1
[2022-07-10 10:53:03,333] [INFO] [logging.py:69:log_dist] [Rank 0] DeepSpeed info: version=0.6.5, git-hash=unknown, git-branch=unknown
[2022-07-10 10:53:04,308] [INFO] [engine.py:278:__init__] DeepSpeed Flops Profiler Enabled: False
[2022-07-10 10:53:04,308] [INFO] [engine.py:1086:_configure_optimizer] Removing param_group that has no 'params' in the client Optimizer
[2022-07-10 10:53:04,308] [INFO] [engine.py:1092:_configure_optimizer] Using client Optimizer as basic optimizer
[2022-07-10 10:53:04,329] [INFO] [engine.py:1108:_configure_optimizer] DeepSpeed Basic Optimizer = DeepSpeedCPUAdam
[2022-07-10 10:53:04,329] [INFO] [utils.py:52:is_zero_supported_optimizer] Checking ZeRO support for optimizer=DeepSpeedCPUAdam type=<class 'deepspeed.ops.adam.cpu_adam.DeepSpeedCPUAdam'>
[2022-07-10 10:53:04,329] [INFO] [logging.py:69:log_dist] [Rank 0] Creating fp16 ZeRO stage 3 optimizer
[2022-07-10 10:53:04,329] [INFO] [engine.py:1410:_configure_zero_optimizer] Initializing ZeRO Stage 3
[2022-07-10 10:53:04,331] [INFO] [stage3.py:275:__init__] Reduce bucket size 200000000
[2022-07-10 10:53:04,331] [INFO] [stage3.py:276:__init__] Prefetch bucket size 50000000
Using /home/neil/.cache/torch_extensions/py38_cu116 as PyTorch extensions root...
Emitting ninja build file /home/neil/.cache/torch_extensions/py38_cu116/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module utils...
Time to load utils op: 0.3200857639312744 seconds
[2022-07-10 10:53:16,734] [INFO] [stage3.py:567:_setup_for_real_optimizer] optimizer state initialized
[2022-07-10 10:53:17,465] [INFO] [utils.py:828:see_memory_usage] After initializing ZeRO optimizer
[2022-07-10 10:53:17,466] [INFO] [utils.py:829:see_memory_usage] MA 10.75 GB         Max_MA 11.71 GB         CA 16.79 GB         Max_CA 17 GB
[2022-07-10 10:53:17,466] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 57.62 GB, percent = 91.9%
[2022-07-10 10:53:17,466] [INFO] [logging.py:69:log_dist] [Rank 0] DeepSpeed Final Optimizer = DeepSpeedCPUAdam
[2022-07-10 10:53:17,466] [INFO] [engine.py:795:_configure_lr_scheduler] DeepSpeed using client LR scheduler
[2022-07-10 10:53:17,466] [INFO] [logging.py:69:log_dist] [Rank 0] DeepSpeed LR Scheduler = None
[2022-07-10 10:53:17,466] [INFO] [logging.py:69:log_dist] [Rank 0] step=0, skipped=0, lr=[0.001], mom=[(0.9, 0.999)]
[2022-07-10 10:53:17,467] [INFO] [config.py:1059:print] DeepSpeedEngine configuration:
[2022-07-10 10:53:17,467] [INFO] [config.py:1063:print]   activation_checkpointing_config  {
    "partition_activations": false,
    "contiguous_memory_optimization": false,
    "cpu_checkpointing": false,
    "number_checkpoints": null,
    "synchronize_checkpoint_boundary": false,
    "profile": false
}
[2022-07-10 10:53:17,467] [INFO] [config.py:1063:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
[2022-07-10 10:53:17,467] [INFO] [config.py:1063:print]   amp_enabled .................. False
[2022-07-10 10:53:17,467] [INFO] [config.py:1063:print]   amp_params ................... False
[2022-07-10 10:53:17,467] [INFO] [config.py:1063:print]   autotuning_config ............ {
    "enabled": false,
    "start_step": null,
    "end_step": null,
    "metric_path": null,
    "arg_mappings": null,
    "metric": "throughput",
    "model_info": null,
    "results_dir": null,
    "exps_dir": null,
    "overwrite": true,
    "fast": true,
    "start_profile_step": 3,
    "end_profile_step": 5,
    "tuner_type": "gridsearch",
    "tuner_early_stopping": 5,
    "tuner_num_trials": 50,
    "model_info_path": null,
    "mp_size": 1,
    "max_train_batch_size": null,
    "min_train_batch_size": 1,
    "max_train_micro_batch_size_per_gpu": 1.024000e+03,
    "min_train_micro_batch_size_per_gpu": 1,
    "num_tuning_micro_batch_sizes": 3
}
[2022-07-10 10:53:17,467] [INFO] [config.py:1063:print]   bfloat16_enabled ............. False
[2022-07-10 10:53:17,467] [INFO] [config.py:1063:print]   checkpoint_tag_validation_enabled  True
[2022-07-10 10:53:17,467] [INFO] [config.py:1063:print]   checkpoint_tag_validation_fail  False
[2022-07-10 10:53:17,467] [INFO] [config.py:1063:print]   communication_data_type ...... None
[2022-07-10 10:53:17,467] [INFO] [config.py:1063:print]   curriculum_enabled ........... False
[2022-07-10 10:53:17,467] [INFO] [config.py:1063:print]   curriculum_params ............ False
[2022-07-10 10:53:17,467] [INFO] [config.py:1063:print]   dataloader_drop_last ......... False
[2022-07-10 10:53:17,467] [INFO] [config.py:1063:print]   disable_allgather ............ False
[2022-07-10 10:53:17,467] [INFO] [config.py:1063:print]   dump_state ................... False
[2022-07-10 10:53:17,467] [INFO] [config.py:1063:print]   dynamic_loss_scale_args ...... None
[2022-07-10 10:53:17,467] [INFO] [config.py:1063:print]   eigenvalue_enabled ........... False
[2022-07-10 10:53:17,468] [INFO] [config.py:1063:print]   eigenvalue_gas_boundary_resolution  1
[2022-07-10 10:53:17,468] [INFO] [config.py:1063:print]   eigenvalue_layer_name ........ bert.encoder.layer
[2022-07-10 10:53:17,468] [INFO] [config.py:1063:print]   eigenvalue_layer_num ......... 0
[2022-07-10 10:53:17,468] [INFO] [config.py:1063:print]   eigenvalue_max_iter .......... 100
[2022-07-10 10:53:17,468] [INFO] [config.py:1063:print]   eigenvalue_stability ......... 1e-06
[2022-07-10 10:53:17,468] [INFO] [config.py:1063:print]   eigenvalue_tol ............... 0.01
[2022-07-10 10:53:17,468] [INFO] [config.py:1063:print]   eigenvalue_verbose ........... False
[2022-07-10 10:53:17,468] [INFO] [config.py:1063:print]   elasticity_enabled ........... False
[2022-07-10 10:53:17,468] [INFO] [config.py:1063:print]   flops_profiler_config ........ {
    "enabled": false,
    "profile_step": 1,
    "module_depth": -1,
    "top_modules": 1,
    "detailed": true,
    "output_file": null
}
[2022-07-10 10:53:17,468] [INFO] [config.py:1063:print]   fp16_enabled ................. False
[2022-07-10 10:53:17,468] [INFO] [config.py:1063:print]   fp16_master_weights_and_gradients  False
[2022-07-10 10:53:17,468] [INFO] [config.py:1063:print]   fp16_mixed_quantize .......... False
[2022-07-10 10:53:17,468] [INFO] [config.py:1063:print]   global_rank .................. 0
[2022-07-10 10:53:17,468] [INFO] [config.py:1063:print]   gradient_accumulation_steps .. 1
[2022-07-10 10:53:17,468] [INFO] [config.py:1063:print]   gradient_clipping ............ 0.0
[2022-07-10 10:53:17,468] [INFO] [config.py:1063:print]   gradient_predivide_factor .... 1.0
[2022-07-10 10:53:17,468] [INFO] [config.py:1063:print]   initial_dynamic_scale ........ 4294967296
[2022-07-10 10:53:17,468] [INFO] [config.py:1063:print]   loss_scale ................... 0
[2022-07-10 10:53:17,468] [INFO] [config.py:1063:print]   memory_breakdown ............. False
[2022-07-10 10:53:17,468] [INFO] [config.py:1063:print]   optimizer_legacy_fusion ...... False
[2022-07-10 10:53:17,468] [INFO] [config.py:1063:print]   optimizer_name ............... None
[2022-07-10 10:53:17,468] [INFO] [config.py:1063:print]   optimizer_params ............. None
[2022-07-10 10:53:17,468] [INFO] [config.py:1063:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
[2022-07-10 10:53:17,468] [INFO] [config.py:1063:print]   pld_enabled .................. False
[2022-07-10 10:53:17,468] [INFO] [config.py:1063:print]   pld_params ................... False
[2022-07-10 10:53:17,468] [INFO] [config.py:1063:print]   prescale_gradients ........... False
[2022-07-10 10:53:17,468] [INFO] [config.py:1063:print]   quantize_change_rate ......... 0.001
[2022-07-10 10:53:17,468] [INFO] [config.py:1063:print]   quantize_groups .............. 1
[2022-07-10 10:53:17,468] [INFO] [config.py:1063:print]   quantize_offset .............. 1000
[2022-07-10 10:53:17,468] [INFO] [config.py:1063:print]   quantize_period .............. 1000
[2022-07-10 10:53:17,468] [INFO] [config.py:1063:print]   quantize_rounding ............ 0
[2022-07-10 10:53:17,468] [INFO] [config.py:1063:print]   quantize_start_bits .......... 16
[2022-07-10 10:53:17,468] [INFO] [config.py:1063:print]   quantize_target_bits ......... 8
[2022-07-10 10:53:17,468] [INFO] [config.py:1063:print]   quantize_training_enabled .... False
[2022-07-10 10:53:17,468] [INFO] [config.py:1063:print]   quantize_type ................ 0
[2022-07-10 10:53:17,468] [INFO] [config.py:1063:print]   quantize_verbose ............. False
[2022-07-10 10:53:17,468] [INFO] [config.py:1063:print]   scheduler_name ............... None
[2022-07-10 10:53:17,468] [INFO] [config.py:1063:print]   scheduler_params ............. None
[2022-07-10 10:53:17,468] [INFO] [config.py:1063:print]   sparse_attention ............. None
[2022-07-10 10:53:17,468] [INFO] [config.py:1063:print]   sparse_gradients_enabled ..... False
[2022-07-10 10:53:17,468] [INFO] [config.py:1063:print]   steps_per_print .............. 10
[2022-07-10 10:53:17,468] [INFO] [config.py:1063:print]   tensorboard_enabled .......... False
[2022-07-10 10:53:17,468] [INFO] [config.py:1063:print]   tensorboard_job_name ......... DeepSpeedJobName
[2022-07-10 10:53:17,468] [INFO] [config.py:1063:print]   tensorboard_output_path ......
[2022-07-10 10:53:17,468] [INFO] [config.py:1063:print]   train_batch_size ............. 1
[2022-07-10 10:53:17,468] [INFO] [config.py:1063:print]   train_micro_batch_size_per_gpu  1
[2022-07-10 10:53:17,468] [INFO] [config.py:1063:print]   use_quantizer_kernel ......... False
[2022-07-10 10:53:17,468] [INFO] [config.py:1063:print]   wall_clock_breakdown ......... False
[2022-07-10 10:53:17,468] [INFO] [config.py:1063:print]   world_size ................... 1
[2022-07-10 10:53:17,468] [INFO] [config.py:1063:print]   zero_allow_untested_optimizer  True
[2022-07-10 10:53:17,468] [INFO] [config.py:1063:print]   zero_config .................. {
    "stage": 3,
    "contiguous_gradients": true,
    "reduce_scatter": true,
    "reduce_bucket_size": 2.000000e+08,
    "allgather_partitions": true,
    "allgather_bucket_size": 2.000000e+08,
    "overlap_comm": true,
    "load_from_fp32_weights": true,
    "elastic_checkpoint": false,
    "offload_param": null,
    "offload_optimizer": {
        "device": "cpu",
        "nvme_path": "/local_nvme",
        "buffer_count": 4,
        "pin_memory": false,
        "pipeline_read": false,
        "pipeline_write": false,
        "fast_init": false,
        "pipeline": false
    },
    "sub_group_size": 1.000000e+12,
    "prefetch_bucket_size": 5.000000e+07,
    "param_persistence_threshold": 1.000000e+05,
    "max_live_parameters": 1.000000e+09,
    "max_reuse_distance": 1.000000e+09,
    "gather_16bit_weights_on_model_save": false,
    "ignore_unused_parameters": true,
    "round_robin_gradients": false,
    "legacy_stage1": false
}
[2022-07-10 10:53:17,468] [INFO] [config.py:1063:print]   zero_enabled ................. True
[2022-07-10 10:53:17,468] [INFO] [config.py:1063:print]   zero_optimization_stage ...... 3
[2022-07-10 10:53:17,469] [INFO] [config.py:1065:print]   json = {
    "zero_allow_untested_optimizer": true,
    "zero_optimization": {
        "stage": 3,
        "contiguous_gradients": true,
        "overlap_comm": true,
        "allgather_partitions": true,
        "reduce_scatter": true,
        "allgather_bucket_size": 2.000000e+08,
        "reduce_bucket_size": 2.000000e+08,
        "sub_group_size": 1.000000e+12,
        "offload_optimizer": {
            "device": "cpu",
            "nvme_path": "/local_nvme",
            "buffer_count": 4,
            "pin_memory": false
        }
    },
    "activation_checkpointing": {
        "partition_activations": false,
        "cpu_checkpointing": false,
        "contiguous_memory_optimization": false,
        "synchronize_checkpoint_boundary": false
    },
    "aio": {
        "block_size": 1.048576e+06,
        "queue_depth": 8,
        "single_submit": false,
        "overlap_events": true,
        "thread_count": 1
    },
    "gradient_accumulation_steps": 1,
    "train_micro_batch_size_per_gpu": 1,
    "gradient_clipping": 0.0
}
Using /home/neil/.cache/torch_extensions/py38_cu116 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.00023055076599121094 seconds

  | Name  | Type              | Params
--------------------------------------------
0 | model | GPTNeoForCausalLM | 0
--------------------------------------------
0         Trainable params
0         Non-trainable params
0         Total params
0.000     Total estimated model params size (MB)
Epoch 0:   0%|                                                                                                                             | 0/18667 [00:00<?, ?it/s][2022-07-10 10:53:21,611] [INFO] [utils.py:828:see_memory_usage] before forward
[2022-07-10 10:53:21,611] [INFO] [utils.py:829:see_memory_usage] MA 10.75 GB         Max_MA 10.75 GB         CA 10.76 GB         Max_CA 17 GB
[2022-07-10 10:53:21,612] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 59.96 GB, percent = 95.6%
[2022-07-10 10:53:22,068] [INFO] [utils.py:828:see_memory_usage] before backward
[2022-07-10 10:53:22,069] [INFO] [utils.py:829:see_memory_usage] MA 11.93 GB         Max_MA 12.39 GB         CA 12.43 GB         Max_CA 12 GB
[2022-07-10 10:53:22,069] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 59.98 GB, percent = 95.7%
[2022-07-10 10:53:22,177] [INFO] [utils.py:828:see_memory_usage] before optimizer
[2022-07-10 10:53:22,178] [INFO] [utils.py:829:see_memory_usage] MA 11.91 GB         Max_MA 11.93 GB         CA 12.43 GB         Max_CA 12 GB
[2022-07-10 10:53:22,178] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 59.98 GB, percent = 95.7%
Killed