Not a member of Pastebin yet?
Sign Up,
it unlocks many cool features!
- ; parser.add_argument("--strategy", default=DeepSpeedStrategy(
- ; stage=3,
- ; offload_optimizer=True,
- ; offload_parameters=False,
- ; logging_level="INFO",
- ; ))
- Global seed set to 8653745
- GPU available: True, used: True
- TPU available: False, using: 0 TPU cores
- IPU available: False, using: 0 IPUs
- HPU available: False, using: 0 HPUs
- /home/neil/.pyvenv/ml/lib/python3.8/site-packages/pytorch_lightning/trainer/configuration_validator.py:131: UserWarning: You passed in a `val_dataloader` but have no `validation_step`. Skipping val loop.
- rank_zero_warn("You passed in a `val_dataloader` but have no `validation_step`. Skipping val loop.")
- /home/neil/.pyvenv/ml/lib/python3.8/site-packages/pytorch_lightning/trainer/configuration_validator.py:412: LightningDeprecationWarning: `LightningDataModule.on_save_checkpoint` was deprecated in v1.6 and will be removed in v1.8. Use `state_dict` instead.
- rank_zero_deprecation(
- /home/neil/.pyvenv/ml/lib/python3.8/site-packages/pytorch_lightning/trainer/configuration_validator.py:417: LightningDeprecationWarning: `LightningDataModule.on_load_checkpoint` was deprecated in v1.6 and will be removed in v1.8. Use `load_state_dict` instead.
- rank_zero_deprecation(
- Global seed set to 8653745
- initializing deepspeed distributed: GLOBAL_RANK: 0, MEMBER: 1/1
- [2022-07-10 10:52:56,013] [INFO] [distributed.py:48:init_distributed] Initializing torch distributed with backend: nccl
- [2022-07-10 10:52:56,015] [WARNING] [deepspeed.py:647:_auto_select_batch_size] Tried to infer the batch size for internal deepspeed logging from the `train_dataloader()`. To ensure DeepSpeed logging remains correct, please manually pass the plugin with the batch size, `Trainer(strategy=DeepSpeedStrategy(logging_batch_size_per_gpu=batch_size))`.
- Reusing dataset wikitext (/home/neil/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126)
- 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 648.34it/s]
- Parameter 'function'=<function Dataset.map.<locals>.decorate.<locals>.decorated at 0x7f4be7a3b790> of the transform datasets.arrow_dataset.Dataset._map_single couldn't be hashed properly, a random hash was used instead. Make sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and caching to work. If you reuse this transform, the caching mechanism will consider it to be different from the previous calls and recompute everything. This warning is only showed once. Subsequent hashing failures won't be showed.
- Loading cached processed dataset at /home/neil/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126/cache-8d4c9428789cfa50.arrow
- Loading cached processed dataset at /home/neil/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126/cache-1a6d3236afea204a.arrow
- Loading cached processed dataset at /home/neil/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126/cache-ff14772e12a6fc92.arrow
- Loading cached processed dataset at /home/neil/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126/cache-3efdba240770b126.arrow
- Loading cached processed dataset at /home/neil/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126/cache-090d3d24d784f74e.arrow
- Loading cached processed dataset at /home/neil/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126/cache-e0650e6b0992b455.arrow
- Estimated memory needed for params, optim states and gradients for a:
- HW: Setup with 1 node, 1 GPU per node.
- SW: Model with 2651M total params, 128M largest layer params.
- per CPU | per GPU | Options
- 66.67GB | 0.48GB | offload_param=cpu , offload_optimizer=cpu , zero_init=1
- 66.67GB | 0.48GB | offload_param=cpu , offload_optimizer=cpu , zero_init=0
- 59.26GB | 5.42GB | offload_param=none, offload_optimizer=cpu , zero_init=1
- 59.26GB | 5.42GB | offload_param=none, offload_optimizer=cpu , zero_init=0
- 0.72GB | 44.93GB | offload_param=none, offload_optimizer=none, zero_init=1
- 14.82GB | 44.93GB | offload_param=none, offload_optimizer=none, zero_init=0
- [2022-07-10 10:52:58,683] [INFO] [utils.py:828:see_memory_usage] after setup
- [2022-07-10 10:52:58,683] [INFO] [utils.py:829:see_memory_usage] MA 0.0 GB Max_MA 0.0 GB CA 0.0 GB Max_CA 0 GB
- [2022-07-10 10:52:58,683] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory: used = 16.12 GB, percent = 25.7%
- [2022-07-10 10:52:58,687] [INFO] [partition_parameters.py:463:__exit__] finished initializing model with 0.00B parameters
- LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
- Using /home/neil/.cache/torch_extensions/py38_cu116 as PyTorch extensions root...
- Detected CUDA files, patching ldflags
- Emitting ninja build file /home/neil/.cache/torch_extensions/py38_cu116/cpu_adam/build.ninja...
- Building extension module cpu_adam...
- Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
- ninja: no work to do.
- Loading extension module cpu_adam...
- Time to load cpu_adam op: 3.200301170349121 seconds
- Adam Optimizer #0 is created with AVX2 arithmetic capability.
- Config: alpha=0.001000, betas=(0.900000, 0.999000), weight_decay=0.000500, adam_w=1
- [2022-07-10 10:53:03,333] [INFO] [logging.py:69:log_dist] [Rank 0] DeepSpeed info: version=0.6.5, git-hash=unknown, git-branch=unknown
- [2022-07-10 10:53:04,308] [INFO] [engine.py:278:__init__] DeepSpeed Flops Profiler Enabled: False
- [2022-07-10 10:53:04,308] [INFO] [engine.py:1086:_configure_optimizer] Removing param_group that has no 'params' in the client Optimizer
- [2022-07-10 10:53:04,308] [INFO] [engine.py:1092:_configure_optimizer] Using client Optimizer as basic optimizer
- [2022-07-10 10:53:04,329] [INFO] [engine.py:1108:_configure_optimizer] DeepSpeed Basic Optimizer = DeepSpeedCPUAdam
- [2022-07-10 10:53:04,329] [INFO] [utils.py:52:is_zero_supported_optimizer] Checking ZeRO support for optimizer=DeepSpeedCPUAdam type=<class 'deepspeed.ops.adam.cpu_adam.DeepSpeedCPUAdam'>
- [2022-07-10 10:53:04,329] [INFO] [logging.py:69:log_dist] [Rank 0] Creating fp16 ZeRO stage 3 optimizer
- [2022-07-10 10:53:04,329] [INFO] [engine.py:1410:_configure_zero_optimizer] Initializing ZeRO Stage 3
- [2022-07-10 10:53:04,331] [INFO] [stage3.py:275:__init__] Reduce bucket size 200000000
- [2022-07-10 10:53:04,331] [INFO] [stage3.py:276:__init__] Prefetch bucket size 50000000
- Using /home/neil/.cache/torch_extensions/py38_cu116 as PyTorch extensions root...
- Emitting ninja build file /home/neil/.cache/torch_extensions/py38_cu116/utils/build.ninja...
- Building extension module utils...
- Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
- ninja: no work to do.
- Loading extension module utils...
- Time to load utils op: 0.3200857639312744 seconds
- [2022-07-10 10:53:16,734] [INFO] [stage3.py:567:_setup_for_real_optimizer] optimizer state initialized
- [2022-07-10 10:53:17,465] [INFO] [utils.py:828:see_memory_usage] After initializing ZeRO optimizer
- [2022-07-10 10:53:17,466] [INFO] [utils.py:829:see_memory_usage] MA 10.75 GB Max_MA 11.71 GB CA 16.79 GB Max_CA 17 GB
- [2022-07-10 10:53:17,466] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory: used = 57.62 GB, percent = 91.9%
- [2022-07-10 10:53:17,466] [INFO] [logging.py:69:log_dist] [Rank 0] DeepSpeed Final Optimizer = DeepSpeedCPUAdam
- [2022-07-10 10:53:17,466] [INFO] [engine.py:795:_configure_lr_scheduler] DeepSpeed using client LR scheduler
- [2022-07-10 10:53:17,466] [INFO] [logging.py:69:log_dist] [Rank 0] DeepSpeed LR Scheduler = None
- [2022-07-10 10:53:17,466] [INFO] [logging.py:69:log_dist] [Rank 0] step=0, skipped=0, lr=[0.001], mom=[(0.9, 0.999)]
- [2022-07-10 10:53:17,467] [INFO] [config.py:1059:print] DeepSpeedEngine configuration:
- [2022-07-10 10:53:17,467] [INFO] [config.py:1063:print] activation_checkpointing_config {
- "partition_activations": false,
- "contiguous_memory_optimization": false,
- "cpu_checkpointing": false,
- "number_checkpoints": null,
- "synchronize_checkpoint_boundary": false,
- "profile": false
- }
- [2022-07-10 10:53:17,467] [INFO] [config.py:1063:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
- [2022-07-10 10:53:17,467] [INFO] [config.py:1063:print] amp_enabled .................. False
- [2022-07-10 10:53:17,467] [INFO] [config.py:1063:print] amp_params ................... False
- [2022-07-10 10:53:17,467] [INFO] [config.py:1063:print] autotuning_config ............ {
- "enabled": false,
- "start_step": null,
- "end_step": null,
- "metric_path": null,
- "arg_mappings": null,
- "metric": "throughput",
- "model_info": null,
- "results_dir": null,
- "exps_dir": null,
- "overwrite": true,
- "fast": true,
- "start_profile_step": 3,
- "end_profile_step": 5,
- "tuner_type": "gridsearch",
- "tuner_early_stopping": 5,
- "tuner_num_trials": 50,
- "model_info_path": null,
- "mp_size": 1,
- "max_train_batch_size": null,
- "min_train_batch_size": 1,
- "max_train_micro_batch_size_per_gpu": 1.024000e+03,
- "min_train_micro_batch_size_per_gpu": 1,
- "num_tuning_micro_batch_sizes": 3
- }
- [2022-07-10 10:53:17,467] [INFO] [config.py:1063:print] bfloat16_enabled ............. False
- [2022-07-10 10:53:17,467] [INFO] [config.py:1063:print] checkpoint_tag_validation_enabled True
- [2022-07-10 10:53:17,467] [INFO] [config.py:1063:print] checkpoint_tag_validation_fail False
- [2022-07-10 10:53:17,467] [INFO] [config.py:1063:print] communication_data_type ...... None
- [2022-07-10 10:53:17,467] [INFO] [config.py:1063:print] curriculum_enabled ........... False
- [2022-07-10 10:53:17,467] [INFO] [config.py:1063:print] curriculum_params ............ False
- [2022-07-10 10:53:17,467] [INFO] [config.py:1063:print] dataloader_drop_last ......... False
- [2022-07-10 10:53:17,467] [INFO] [config.py:1063:print] disable_allgather ............ False
- [2022-07-10 10:53:17,467] [INFO] [config.py:1063:print] dump_state ................... False
- [2022-07-10 10:53:17,467] [INFO] [config.py:1063:print] dynamic_loss_scale_args ...... None
- [2022-07-10 10:53:17,467] [INFO] [config.py:1063:print] eigenvalue_enabled ........... False
- [2022-07-10 10:53:17,468] [INFO] [config.py:1063:print] eigenvalue_gas_boundary_resolution 1
- [2022-07-10 10:53:17,468] [INFO] [config.py:1063:print] eigenvalue_layer_name ........ bert.encoder.layer
- [2022-07-10 10:53:17,468] [INFO] [config.py:1063:print] eigenvalue_layer_num ......... 0
- [2022-07-10 10:53:17,468] [INFO] [config.py:1063:print] eigenvalue_max_iter .......... 100
- [2022-07-10 10:53:17,468] [INFO] [config.py:1063:print] eigenvalue_stability ......... 1e-06
- [2022-07-10 10:53:17,468] [INFO] [config.py:1063:print] eigenvalue_tol ............... 0.01
- [2022-07-10 10:53:17,468] [INFO] [config.py:1063:print] eigenvalue_verbose ........... False
- [2022-07-10 10:53:17,468] [INFO] [config.py:1063:print] elasticity_enabled ........... False
- [2022-07-10 10:53:17,468] [INFO] [config.py:1063:print] flops_profiler_config ........ {
- "enabled": false,
- "profile_step": 1,
- "module_depth": -1,
- "top_modules": 1,
- "detailed": true,
- "output_file": null
- }
- [2022-07-10 10:53:17,468] [INFO] [config.py:1063:print] fp16_enabled ................. False
- [2022-07-10 10:53:17,468] [INFO] [config.py:1063:print] fp16_master_weights_and_gradients False
- [2022-07-10 10:53:17,468] [INFO] [config.py:1063:print] fp16_mixed_quantize .......... False
- [2022-07-10 10:53:17,468] [INFO] [config.py:1063:print] global_rank .................. 0
- [2022-07-10 10:53:17,468] [INFO] [config.py:1063:print] gradient_accumulation_steps .. 1
- [2022-07-10 10:53:17,468] [INFO] [config.py:1063:print] gradient_clipping ............ 0.0
- [2022-07-10 10:53:17,468] [INFO] [config.py:1063:print] gradient_predivide_factor .... 1.0
- [2022-07-10 10:53:17,468] [INFO] [config.py:1063:print] initial_dynamic_scale ........ 4294967296
- [2022-07-10 10:53:17,468] [INFO] [config.py:1063:print] loss_scale ................... 0
- [2022-07-10 10:53:17,468] [INFO] [config.py:1063:print] memory_breakdown ............. False
- [2022-07-10 10:53:17,468] [INFO] [config.py:1063:print] optimizer_legacy_fusion ...... False
- [2022-07-10 10:53:17,468] [INFO] [config.py:1063:print] optimizer_name ............... None
- [2022-07-10 10:53:17,468] [INFO] [config.py:1063:print] optimizer_params ............. None
- [2022-07-10 10:53:17,468] [INFO] [config.py:1063:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
- [2022-07-10 10:53:17,468] [INFO] [config.py:1063:print] pld_enabled .................. False
- [2022-07-10 10:53:17,468] [INFO] [config.py:1063:print] pld_params ................... False
- [2022-07-10 10:53:17,468] [INFO] [config.py:1063:print] prescale_gradients ........... False
- [2022-07-10 10:53:17,468] [INFO] [config.py:1063:print] quantize_change_rate ......... 0.001
- [2022-07-10 10:53:17,468] [INFO] [config.py:1063:print] quantize_groups .............. 1
- [2022-07-10 10:53:17,468] [INFO] [config.py:1063:print] quantize_offset .............. 1000
- [2022-07-10 10:53:17,468] [INFO] [config.py:1063:print] quantize_period .............. 1000
- [2022-07-10 10:53:17,468] [INFO] [config.py:1063:print] quantize_rounding ............ 0
- [2022-07-10 10:53:17,468] [INFO] [config.py:1063:print] quantize_start_bits .......... 16
- [2022-07-10 10:53:17,468] [INFO] [config.py:1063:print] quantize_target_bits ......... 8
- [2022-07-10 10:53:17,468] [INFO] [config.py:1063:print] quantize_training_enabled .... False
- [2022-07-10 10:53:17,468] [INFO] [config.py:1063:print] quantize_type ................ 0
- [2022-07-10 10:53:17,468] [INFO] [config.py:1063:print] quantize_verbose ............. False
- [2022-07-10 10:53:17,468] [INFO] [config.py:1063:print] scheduler_name ............... None
- [2022-07-10 10:53:17,468] [INFO] [config.py:1063:print] scheduler_params ............. None
- [2022-07-10 10:53:17,468] [INFO] [config.py:1063:print] sparse_attention ............. None
- [2022-07-10 10:53:17,468] [INFO] [config.py:1063:print] sparse_gradients_enabled ..... False
- [2022-07-10 10:53:17,468] [INFO] [config.py:1063:print] steps_per_print .............. 10
- [2022-07-10 10:53:17,468] [INFO] [config.py:1063:print] tensorboard_enabled .......... False
- [2022-07-10 10:53:17,468] [INFO] [config.py:1063:print] tensorboard_job_name ......... DeepSpeedJobName
- [2022-07-10 10:53:17,468] [INFO] [config.py:1063:print] tensorboard_output_path ......
- [2022-07-10 10:53:17,468] [INFO] [config.py:1063:print] train_batch_size ............. 1
- [2022-07-10 10:53:17,468] [INFO] [config.py:1063:print] train_micro_batch_size_per_gpu 1
- [2022-07-10 10:53:17,468] [INFO] [config.py:1063:print] use_quantizer_kernel ......... False
- [2022-07-10 10:53:17,468] [INFO] [config.py:1063:print] wall_clock_breakdown ......... False
- [2022-07-10 10:53:17,468] [INFO] [config.py:1063:print] world_size ................... 1
- [2022-07-10 10:53:17,468] [INFO] [config.py:1063:print] zero_allow_untested_optimizer True
- [2022-07-10 10:53:17,468] [INFO] [config.py:1063:print] zero_config .................. {
- "stage": 3,
- "contiguous_gradients": true,
- "reduce_scatter": true,
- "reduce_bucket_size": 2.000000e+08,
- "allgather_partitions": true,
- "allgather_bucket_size": 2.000000e+08,
- "overlap_comm": true,
- "load_from_fp32_weights": true,
- "elastic_checkpoint": false,
- "offload_param": null,
- "offload_optimizer": {
- "device": "cpu",
- "nvme_path": "/local_nvme",
- "buffer_count": 4,
- "pin_memory": false,
- "pipeline_read": false,
- "pipeline_write": false,
- "fast_init": false,
- "pipeline": false
- },
- "sub_group_size": 1.000000e+12,
- "prefetch_bucket_size": 5.000000e+07,
- "param_persistence_threshold": 1.000000e+05,
- "max_live_parameters": 1.000000e+09,
- "max_reuse_distance": 1.000000e+09,
- "gather_16bit_weights_on_model_save": false,
- "ignore_unused_parameters": true,
- "round_robin_gradients": false,
- "legacy_stage1": false
- }
- [2022-07-10 10:53:17,468] [INFO] [config.py:1063:print] zero_enabled ................. True
- [2022-07-10 10:53:17,468] [INFO] [config.py:1063:print] zero_optimization_stage ...... 3
- [2022-07-10 10:53:17,469] [INFO] [config.py:1065:print] json = {
- "zero_allow_untested_optimizer": true,
- "zero_optimization": {
- "stage": 3,
- "contiguous_gradients": true,
- "overlap_comm": true,
- "allgather_partitions": true,
- "reduce_scatter": true,
- "allgather_bucket_size": 2.000000e+08,
- "reduce_bucket_size": 2.000000e+08,
- "sub_group_size": 1.000000e+12,
- "offload_optimizer": {
- "device": "cpu",
- "nvme_path": "/local_nvme",
- "buffer_count": 4,
- "pin_memory": false
- }
- },
- "activation_checkpointing": {
- "partition_activations": false,
- "cpu_checkpointing": false,
- "contiguous_memory_optimization": false,
- "synchronize_checkpoint_boundary": false
- },
- "aio": {
- "block_size": 1.048576e+06,
- "queue_depth": 8,
- "single_submit": false,
- "overlap_events": true,
- "thread_count": 1
- },
- "gradient_accumulation_steps": 1,
- "train_micro_batch_size_per_gpu": 1,
- "gradient_clipping": 0.0
- }
- Using /home/neil/.cache/torch_extensions/py38_cu116 as PyTorch extensions root...
- No modifications detected for re-loaded extension module utils, skipping build step...
- Loading extension module utils...
- Time to load utils op: 0.00023055076599121094 seconds
- | Name | Type | Params
- --------------------------------------------
- 0 | model | GPTNeoForCausalLM | 0
- --------------------------------------------
- 0 Trainable params
- 0 Non-trainable params
- 0 Total params
- 0.000 Total estimated model params size (MB)
- Epoch 0: 0%| | 0/18667 [00:00<?, ?it/s][2022-07-10 10:53:21,611] [INFO] [utils.py:828:see_memory_usage] before forward
- [2022-07-10 10:53:21,611] [INFO] [utils.py:829:see_memory_usage] MA 10.75 GB Max_MA 10.75 GB CA 10.76 GB Max_CA 17 GB
- [2022-07-10 10:53:21,612] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory: used = 59.96 GB, percent = 95.6%
- [2022-07-10 10:53:22,068] [INFO] [utils.py:828:see_memory_usage] before backward
- [2022-07-10 10:53:22,069] [INFO] [utils.py:829:see_memory_usage] MA 11.93 GB Max_MA 12.39 GB CA 12.43 GB Max_CA 12 GB
- [2022-07-10 10:53:22,069] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory: used = 59.98 GB, percent = 95.7%
- [2022-07-10 10:53:22,177] [INFO] [utils.py:828:see_memory_usage] before optimizer
- [2022-07-10 10:53:22,178] [INFO] [utils.py:829:see_memory_usage] MA 11.91 GB Max_MA 11.93 GB CA 12.43 GB Max_CA 12 GB
- [2022-07-10 10:53:22,178] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory: used = 59.98 GB, percent = 95.7%
- Killed
Advertisement
Add Comment
Please, Sign In to add comment