Advertisement
Not a member of Pastebin yet?
Sign Up,
it unlocks many cool features!
- CUDA_VISIBLE_DEVICES=9 bash scripts/mnist.sh
- [2021-08-09 15:11:30,513] [WARNING] [runner.py:122:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
- [2021-08-09 15:11:30,607] [INFO] [runner.py:360:main] cmd = /home/aumam/.conda/envs/setvae/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=29500 train.py --kl_warmup_epochs 50 --input_dim 2 --max_outputs 400 --init_dim 32 --n_mixtures 4 --z_dim 16 --z_scales 2 4 8 16 32 --hidden_dim 64 --num_heads 4 --lr 1e-3 --beta 1e-2 --epochs 200 --dataset_type mnist --log_name gen/mnist/camera-ready --mnist_data_dir cache/mnist --resume_optimizer --save_freq 10 --viz_freq 10 --log_freq 10 --val_freq 1000 --scheduler linear --slot_att --ln --seed 42 --distributed --deepspeed_config batch_size.json
- [2021-08-09 15:11:31,350] [INFO] [launch.py:80:main] WORLD INFO DICT: {'localhost': [0]}
- [2021-08-09 15:11:31,350] [INFO] [launch.py:89:main] nnodes=1, num_local_procs=1, node_rank=0
- [2021-08-09 15:11:31,350] [INFO] [launch.py:101:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]})
- [2021-08-09 15:11:31,350] [INFO] [launch.py:102:main] dist_world_size=1
- [2021-08-09 15:11:31,350] [INFO] [launch.py:105:main] Setting CUDA_VISIBLE_DEVICES=0
- Arguments:
- Namespace(activation='relu', batch_size=32, beta=0.01, beta1=0.9, beta2=0.999, bn_mode='eval', cates=['airplane'], d_net='set_transformer', dataset_scale=1.0, dataset_type='mnist', dec_in_layers=0, dec_out_layers=0, deepscale=False, deepscale_config=None, deepspeed=False, deepspeed_config='batch_size.json', deepspeed_mpi=False, denormalized_loss=False, device='cuda', digits=None, dist_backend='nccl', dist_url='tcp://127.0.0.1:9991', distributed=True, dropout_p=0.0, enc_in_layers=0, epochs=200, eval=False, eval_with_train_offset=False, exp_decay=1.0, exp_decay_freq=1, fixed_gmm=False, gpu=None, hidden_dim=64, i_net='elem_mlp', i_net_layers=0, init_dim=32, input_dim=2, isab_inds=16, kl_warmup_epochs=50, ln=True, local_rank=0, log_freq=10, log_name='gen/mnist/camera-ready', lr=0.001, matcher='chamfer', max_grad_norm=5.0, max_grad_threshold=None, max_outputs=400, max_validate_shapes=None, mnist_cache=None, mnist_data_dir='cache/mnist', momentum=0.9, multimnist_cache=None, multimnist_data_dir='cache/multimnist', n_mixtures=4, no_eval_sampling=False, no_validation=False, normalize_per_shape=False, normalize_std_per_axis=False, num_heads=4, num_workers=4, optimizer='adam', rank=0, residual=False, resume=False, resume_checkpoint=None, resume_dataset_mean=None, resume_dataset_std=None, resume_non_strict=False, resume_optimizer=True, save_freq=10, save_val_results=False, scheduler='linear', seed=42, shapenet_data_dir='/data/shapenet/ShapeNetCore.v2.PC15k', slot_att=True, standardize_per_shape=False, te_max_sample_points=2048, threshold=0.0, tr_max_sample_points=2048, train_gmm=False, use_bn=False, val_freq=1000, val_recon_only=False, viz_freq=10, warmup_epochs=0, weight_decay=0.0, world_size=1, z_dim=16, z_scales=[2, 4, 8, 16, 32])
- [2021-08-09 15:11:33,027] [INFO] [distributed.py:47:init_distributed] Initializing torch distributed with backend: nccl
- number of params: 538914
- number of generator params: 282594
- Total number of data:60000
- Max number of points: (train)342
- Total number of data:10000
- Max number of points: (test)290
- [2021-08-09 15:11:42,980] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed info: version=0.4.4, git-hash=unknown, git-branch=unknown
- [2021-08-09 15:11:42,997] [INFO] [utils.py:13:_initialize_parameter_parallel_groups] data_parallel_size: 1, parameter_parallel_size: 1
- [2021-08-09 15:11:43,093] [INFO] [engine.py:180:__init__] DeepSpeed Flops Profiler Enabled: False
- [2021-08-09 15:11:43,093] [INFO] [engine.py:703:_configure_optimizer] Removing param_group that has no 'params' in the client Optimizer
- [2021-08-09 15:11:43,094] [INFO] [engine.py:707:_configure_optimizer] Using client Optimizer as basic optimizer
- [2021-08-09 15:11:43,094] [INFO] [engine.py:717:_configure_optimizer] DeepSpeed Basic Optimizer = Adam
- [2021-08-09 15:11:43,094] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed Final Optimizer = Adam
- [2021-08-09 15:11:43,094] [INFO] [engine.py:519:_configure_lr_scheduler] DeepSpeed using client LR scheduler
- [2021-08-09 15:11:43,094] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed LR Scheduler = <torch.optim.lr_scheduler.LambdaLR object at 0x7f8b3d1e2b38>
- [2021-08-09 15:11:43,094] [INFO] [logging.py:68:log_dist] [Rank 0] step=0, skipped=0, lr=[0.001], mom=[(0.9, 0.999)]
- [2021-08-09 15:11:43,094] [INFO] [config.py:900:print] DeepSpeedEngine configuration:
- [2021-08-09 15:11:43,095] [INFO] [config.py:904:print] activation_checkpointing_config {
- "partition_activations": false,
- "contiguous_memory_optimization": false,
- "cpu_checkpointing": false,
- "number_checkpoints": null,
- "synchronize_checkpoint_boundary": false,
- "profile": false
- }
- [2021-08-09 15:11:43,095] [INFO] [config.py:904:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
- [2021-08-09 15:11:43,095] [INFO] [config.py:904:print] allreduce_always_fp32 ........ False
- [2021-08-09 15:11:43,095] [INFO] [config.py:904:print] amp_enabled .................. False
- [2021-08-09 15:11:43,095] [INFO] [config.py:904:print] amp_params ................... False
- [2021-08-09 15:11:43,095] [INFO] [config.py:904:print] checkpoint_tag_validation_enabled True
- [2021-08-09 15:11:43,095] [INFO] [config.py:904:print] checkpoint_tag_validation_fail False
- [2021-08-09 15:11:43,095] [INFO] [config.py:904:print] disable_allgather ............ False
- [2021-08-09 15:11:43,095] [INFO] [config.py:904:print] dump_state ................... False
- [2021-08-09 15:11:43,095] [INFO] [config.py:904:print] dynamic_loss_scale_args ...... None
- [2021-08-09 15:11:43,095] [INFO] [config.py:904:print] eigenvalue_enabled ........... False
- [2021-08-09 15:11:43,095] [INFO] [config.py:904:print] eigenvalue_gas_boundary_resolution 1
- [2021-08-09 15:11:43,095] [INFO] [config.py:904:print] eigenvalue_layer_name ........ bert.encoder.layer
- [2021-08-09 15:11:43,095] [INFO] [config.py:904:print] eigenvalue_layer_num ......... 0
- [2021-08-09 15:11:43,095] [INFO] [config.py:904:print] eigenvalue_max_iter .......... 100
- [2021-08-09 15:11:43,095] [INFO] [config.py:904:print] eigenvalue_stability ......... 1e-06
- [2021-08-09 15:11:43,095] [INFO] [config.py:904:print] eigenvalue_tol ............... 0.01
- [2021-08-09 15:11:43,095] [INFO] [config.py:904:print] eigenvalue_verbose ........... False
- [2021-08-09 15:11:43,095] [INFO] [config.py:904:print] elasticity_enabled ........... False
- [2021-08-09 15:11:43,095] [INFO] [config.py:904:print] flops_profiler_config ........ {
- "enabled": false,
- "profile_step": 1,
- "module_depth": -1,
- "top_modules": 1,
- "detailed": true,
- "output_file": null
- }
- [2021-08-09 15:11:43,095] [INFO] [config.py:904:print] fp16_enabled ................. False
- [2021-08-09 15:11:43,095] [INFO] [config.py:904:print] fp16_mixed_quantize .......... False
- [2021-08-09 15:11:43,095] [INFO] [config.py:904:print] global_rank .................. 0
- [2021-08-09 15:11:43,095] [INFO] [config.py:904:print] gradient_accumulation_steps .. 1
- [2021-08-09 15:11:43,095] [INFO] [config.py:904:print] gradient_clipping ............ 0.0
- [2021-08-09 15:11:43,095] [INFO] [config.py:904:print] gradient_predivide_factor .... 1.0
- [2021-08-09 15:11:43,095] [INFO] [config.py:904:print] initial_dynamic_scale ........ 4294967296
- [2021-08-09 15:11:43,095] [INFO] [config.py:904:print] loss_scale ................... 0
- [2021-08-09 15:11:43,095] [INFO] [config.py:904:print] memory_breakdown ............. False
- [2021-08-09 15:11:43,095] [INFO] [config.py:904:print] optimizer_legacy_fusion ...... False
- [2021-08-09 15:11:43,095] [INFO] [config.py:904:print] optimizer_name ............... None
- [2021-08-09 15:11:43,096] [INFO] [config.py:904:print] optimizer_params ............. None
- [2021-08-09 15:11:43,096] [INFO] [config.py:904:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
- [2021-08-09 15:11:43,096] [INFO] [config.py:904:print] pld_enabled .................. False
- [2021-08-09 15:11:43,096] [INFO] [config.py:904:print] pld_params ................... False
- [2021-08-09 15:11:43,096] [INFO] [config.py:904:print] prescale_gradients ........... False
- [2021-08-09 15:11:43,096] [INFO] [config.py:904:print] quantize_change_rate ......... 0.001
- [2021-08-09 15:11:43,096] [INFO] [config.py:904:print] quantize_groups .............. 1
- [2021-08-09 15:11:43,096] [INFO] [config.py:904:print] quantize_offset .............. 1000
- [2021-08-09 15:11:43,096] [INFO] [config.py:904:print] quantize_period .............. 1000
- [2021-08-09 15:11:43,096] [INFO] [config.py:904:print] quantize_rounding ............ 0
- [2021-08-09 15:11:43,096] [INFO] [config.py:904:print] quantize_start_bits .......... 16
- [2021-08-09 15:11:43,096] [INFO] [config.py:904:print] quantize_target_bits ......... 8
- [2021-08-09 15:11:43,096] [INFO] [config.py:904:print] quantize_training_enabled .... False
- [2021-08-09 15:11:43,096] [INFO] [config.py:904:print] quantize_type ................ 0
- [2021-08-09 15:11:43,096] [INFO] [config.py:904:print] quantize_verbose ............. False
- [2021-08-09 15:11:43,096] [INFO] [config.py:904:print] scheduler_name ............... None
- [2021-08-09 15:11:43,096] [INFO] [config.py:904:print] scheduler_params ............. None
- [2021-08-09 15:11:43,096] [INFO] [config.py:904:print] sparse_attention ............. None
- [2021-08-09 15:11:43,096] [INFO] [config.py:904:print] sparse_gradients_enabled ..... False
- [2021-08-09 15:11:43,096] [INFO] [config.py:904:print] steps_per_print .............. 10
- [2021-08-09 15:11:43,096] [INFO] [config.py:904:print] tensorboard_enabled .......... False
- [2021-08-09 15:11:43,096] [INFO] [config.py:904:print] tensorboard_job_name ......... DeepSpeedJobName
- [2021-08-09 15:11:43,096] [INFO] [config.py:904:print] tensorboard_output_path ......
- [2021-08-09 15:11:43,096] [INFO] [config.py:904:print] train_batch_size ............. 1
- [2021-08-09 15:11:43,096] [INFO] [config.py:904:print] train_micro_batch_size_per_gpu 1
- [2021-08-09 15:11:43,096] [INFO] [config.py:904:print] use_quantizer_kernel ......... False
- [2021-08-09 15:11:43,096] [INFO] [config.py:904:print] wall_clock_breakdown ......... False
- [2021-08-09 15:11:43,096] [INFO] [config.py:904:print] world_size ................... 1
- [2021-08-09 15:11:43,096] [INFO] [config.py:904:print] zero_allow_untested_optimizer False
- [2021-08-09 15:11:43,096] [INFO] [config.py:904:print] zero_config .................. {
- "stage": 0,
- "contiguous_gradients": true,
- "reduce_scatter": true,
- "reduce_bucket_size": 5.000000e+08,
- "allgather_partitions": true,
- "allgather_bucket_size": 5.000000e+08,
- "overlap_comm": false,
- "load_from_fp32_weights": true,
- "elastic_checkpoint": true,
- "offload_param": null,
- "offload_optimizer": null,
- "sub_group_size": 1.000000e+09,
- "prefetch_bucket_size": 5.000000e+07,
- "param_persistence_threshold": 1.000000e+05,
- "max_live_parameters": 1.000000e+09,
- "max_reuse_distance": 1.000000e+09,
- "gather_fp16_weights_on_model_save": false,
- "ignore_unused_parameters": true,
- "round_robin_gradients": false,
- "legacy_stage1": false
- }
- [2021-08-09 15:11:43,096] [INFO] [config.py:904:print] zero_enabled ................. False
- [2021-08-09 15:11:43,096] [INFO] [config.py:904:print] zero_optimization_stage ...... 0
- [2021-08-09 15:11:43,097] [INFO] [config.py:912:print] json = {
- "train_batch_size": 1
- }
- Using /home/aumam/.cache/torch_extensions as PyTorch extensions root...
- Emitting ninja build file /home/aumam/.cache/torch_extensions/utils/build.ninja...
- Building extension module utils...
- Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
- ninja: no work to do.
- Loading extension module utils...
- Time to load utils op: 0.5592496395111084 seconds
- Start epoch: 0 End epoch: 200
- Traceback (most recent call last):
- File "train.py", line 223, in <module>
- main()
- File "train.py", line 219, in main
- main_worker(save_dir, args)
- File "train.py", line 166, in main_worker
- train_one_epoch(epoch, model, criterion, optimizer, args, train_loader, avg_meters, logger)
- File "/home/aumam/dev/gan/new_setvae/setvae/engine.py", line 24, in train_one_epoch
- output = model(gt, gt_mask)
- File "/home/aumam/.conda/envs/setvae/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
- result = self.forward(*input, **kwargs)
- File "/home/aumam/.conda/envs/setvae/lib/python3.6/site-packages/deepspeed/runtime/engine.py", line 1118, in forward
- loss = self.module(*inputs, **kwargs)
- File "/home/aumam/.conda/envs/setvae/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
- result = self.forward(*input, **kwargs)
- File "/home/aumam/dev/gan/new_setvae/setvae/models/networks.py", line 221, in forward
- bup = self.bottom_up(x, x_mask)
- File "/home/aumam/dev/gan/new_setvae/setvae/models/networks.py", line 183, in bottom_up
- x = self.input(x) # [B, N, D]
- File "/home/aumam/.conda/envs/setvae/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
- result = self.forward(*input, **kwargs)
- File "/home/aumam/.conda/envs/setvae/lib/python3.6/site-packages/torch/nn/modules/linear.py", line 94, in forward
- return F.linear(input, self.weight, self.bias)
- File "/home/aumam/.conda/envs/setvae/lib/python3.6/site-packages/torch/nn/functional.py", line 1753, in linear
- return torch._C._nn.linear(input, weight, bias)
- RuntimeError: CUDA error: CUBLAS_STATUS_INTERNAL_ERROR when calling `cublasCreate(handle)`
- Killing subprocess 43124
- Traceback (most recent call last):
- File "/home/aumam/.conda/envs/setvae/lib/python3.6/runpy.py", line 193, in _run_module_as_main
- "__main__", mod_spec)
- File "/home/aumam/.conda/envs/setvae/lib/python3.6/runpy.py", line 85, in _run_code
- exec(code, run_globals)
- File "/home/aumam/.conda/envs/setvae/lib/python3.6/site-packages/deepspeed/launcher/launch.py", line 171, in <module>
- main()
- File "/home/aumam/.conda/envs/setvae/lib/python3.6/site-packages/deepspeed/launcher/launch.py", line 161, in main
- sigkill_handler(signal.SIGTERM, None) # not coming back
- File "/home/aumam/.conda/envs/setvae/lib/python3.6/site-packages/deepspeed/launcher/launch.py", line 139, in sigkill_handler
- raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
- subprocess.CalledProcessError: Command '['/home/aumam/.conda/envs/setvae/bin/python', '-u', 'train.py', '--local_rank=0', '--kl_warmup_epochs', '50', '--input_dim', '2', '--max_outputs', '400', '--init_dim', '32', '--n_mixtures', '4', '--z_dim', '16', '--z_scales', '2', '4', '8', '16', '32', '--hidden_dim', '64', '--num_heads', '4', '--lr', '1e-3', '--beta', '1e-2', '--epochs', '200', '--dataset_type', 'mnist', '--log_name', 'gen/mnist/camera-ready', '--mnist_data_dir', 'cache/mnist', '--resume_optimizer', '--save_freq', '10', '--viz_freq', '10', '--log_freq', '10', '--val_freq', '1000', '--scheduler', 'linear', '--slot_att', '--ln', '--seed', '42', '--distributed', '--deepspeed_config', 'batch_size.json']' returned non-zero exit status 1.
- Done
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement