Advertisement
ArdianUmam

SetVae_run-mist

Aug 9th, 2021
1,387
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
Bash 15.43 KB | None | 0 0
  1. CUDA_VISIBLE_DEVICES=9 bash scripts/mnist.sh
  2. [2021-08-09 15:11:30,513] [WARNING] [runner.py:122:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
  3. [2021-08-09 15:11:30,607] [INFO] [runner.py:360:main] cmd = /home/aumam/.conda/envs/setvae/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=29500 train.py --kl_warmup_epochs 50 --input_dim 2 --max_outputs 400 --init_dim 32 --n_mixtures 4 --z_dim 16 --z_scales 2 4 8 16 32 --hidden_dim 64 --num_heads 4 --lr 1e-3 --beta 1e-2 --epochs 200 --dataset_type mnist --log_name gen/mnist/camera-ready --mnist_data_dir cache/mnist --resume_optimizer --save_freq 10 --viz_freq 10 --log_freq 10 --val_freq 1000 --scheduler linear --slot_att --ln --seed 42 --distributed --deepspeed_config batch_size.json
  4. [2021-08-09 15:11:31,350] [INFO] [launch.py:80:main] WORLD INFO DICT: {'localhost': [0]}
  5. [2021-08-09 15:11:31,350] [INFO] [launch.py:89:main] nnodes=1, num_local_procs=1, node_rank=0
  6. [2021-08-09 15:11:31,350] [INFO] [launch.py:101:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]})
  7. [2021-08-09 15:11:31,350] [INFO] [launch.py:102:main] dist_world_size=1
  8. [2021-08-09 15:11:31,350] [INFO] [launch.py:105:main] Setting CUDA_VISIBLE_DEVICES=0
  9. Arguments:
  10. Namespace(activation='relu', batch_size=32, beta=0.01, beta1=0.9, beta2=0.999, bn_mode='eval', cates=['airplane'], d_net='set_transformer', dataset_scale=1.0, dataset_type='mnist', dec_in_layers=0, dec_out_layers=0, deepscale=False, deepscale_config=None, deepspeed=False, deepspeed_config='batch_size.json', deepspeed_mpi=False, denormalized_loss=False, device='cuda', digits=None, dist_backend='nccl', dist_url='tcp://127.0.0.1:9991', distributed=True, dropout_p=0.0, enc_in_layers=0, epochs=200, eval=False, eval_with_train_offset=False, exp_decay=1.0, exp_decay_freq=1, fixed_gmm=False, gpu=None, hidden_dim=64, i_net='elem_mlp', i_net_layers=0, init_dim=32, input_dim=2, isab_inds=16, kl_warmup_epochs=50, ln=True, local_rank=0, log_freq=10, log_name='gen/mnist/camera-ready', lr=0.001, matcher='chamfer', max_grad_norm=5.0, max_grad_threshold=None, max_outputs=400, max_validate_shapes=None, mnist_cache=None, mnist_data_dir='cache/mnist', momentum=0.9, multimnist_cache=None, multimnist_data_dir='cache/multimnist', n_mixtures=4, no_eval_sampling=False, no_validation=False, normalize_per_shape=False, normalize_std_per_axis=False, num_heads=4, num_workers=4, optimizer='adam', rank=0, residual=False, resume=False, resume_checkpoint=None, resume_dataset_mean=None, resume_dataset_std=None, resume_non_strict=False, resume_optimizer=True, save_freq=10, save_val_results=False, scheduler='linear', seed=42, shapenet_data_dir='/data/shapenet/ShapeNetCore.v2.PC15k', slot_att=True, standardize_per_shape=False, te_max_sample_points=2048, threshold=0.0, tr_max_sample_points=2048, train_gmm=False, use_bn=False, val_freq=1000, val_recon_only=False, viz_freq=10, warmup_epochs=0, weight_decay=0.0, world_size=1, z_dim=16, z_scales=[2, 4, 8, 16, 32])
  11. [2021-08-09 15:11:33,027] [INFO] [distributed.py:47:init_distributed] Initializing torch distributed with backend: nccl
  12. number of params: 538914
  13. number of generator params: 282594
  14. Total number of data:60000
  15. Max number of points: (train)342
  16. Total number of data:10000
  17. Max number of points: (test)290
  18. [2021-08-09 15:11:42,980] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed info: version=0.4.4, git-hash=unknown, git-branch=unknown
  19. [2021-08-09 15:11:42,997] [INFO] [utils.py:13:_initialize_parameter_parallel_groups] data_parallel_size: 1, parameter_parallel_size: 1
  20. [2021-08-09 15:11:43,093] [INFO] [engine.py:180:__init__] DeepSpeed Flops Profiler Enabled: False
  21. [2021-08-09 15:11:43,093] [INFO] [engine.py:703:_configure_optimizer] Removing param_group that has no 'params' in the client Optimizer
  22. [2021-08-09 15:11:43,094] [INFO] [engine.py:707:_configure_optimizer] Using client Optimizer as basic optimizer
  23. [2021-08-09 15:11:43,094] [INFO] [engine.py:717:_configure_optimizer] DeepSpeed Basic Optimizer = Adam
  24. [2021-08-09 15:11:43,094] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed Final Optimizer = Adam
  25. [2021-08-09 15:11:43,094] [INFO] [engine.py:519:_configure_lr_scheduler] DeepSpeed using client LR scheduler
  26. [2021-08-09 15:11:43,094] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed LR Scheduler = <torch.optim.lr_scheduler.LambdaLR object at 0x7f8b3d1e2b38>
  27. [2021-08-09 15:11:43,094] [INFO] [logging.py:68:log_dist] [Rank 0] step=0, skipped=0, lr=[0.001], mom=[(0.9, 0.999)]
  28. [2021-08-09 15:11:43,094] [INFO] [config.py:900:print] DeepSpeedEngine configuration:
  29. [2021-08-09 15:11:43,095] [INFO] [config.py:904:print]   activation_checkpointing_config  {
  30.     "partition_activations": false,
  31.     "contiguous_memory_optimization": false,
  32.     "cpu_checkpointing": false,
  33.     "number_checkpoints": null,
  34.     "synchronize_checkpoint_boundary": false,
  35.     "profile": false
  36. }
  37. [2021-08-09 15:11:43,095] [INFO] [config.py:904:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
  38. [2021-08-09 15:11:43,095] [INFO] [config.py:904:print]   allreduce_always_fp32 ........ False
  39. [2021-08-09 15:11:43,095] [INFO] [config.py:904:print]   amp_enabled .................. False
  40. [2021-08-09 15:11:43,095] [INFO] [config.py:904:print]   amp_params ................... False
  41. [2021-08-09 15:11:43,095] [INFO] [config.py:904:print]   checkpoint_tag_validation_enabled  True
  42. [2021-08-09 15:11:43,095] [INFO] [config.py:904:print]   checkpoint_tag_validation_fail  False
  43. [2021-08-09 15:11:43,095] [INFO] [config.py:904:print]   disable_allgather ............ False
  44. [2021-08-09 15:11:43,095] [INFO] [config.py:904:print]   dump_state ................... False
  45. [2021-08-09 15:11:43,095] [INFO] [config.py:904:print]   dynamic_loss_scale_args ...... None
  46. [2021-08-09 15:11:43,095] [INFO] [config.py:904:print]   eigenvalue_enabled ........... False
  47. [2021-08-09 15:11:43,095] [INFO] [config.py:904:print]   eigenvalue_gas_boundary_resolution  1
  48. [2021-08-09 15:11:43,095] [INFO] [config.py:904:print]   eigenvalue_layer_name ........ bert.encoder.layer
  49. [2021-08-09 15:11:43,095] [INFO] [config.py:904:print]   eigenvalue_layer_num ......... 0
  50. [2021-08-09 15:11:43,095] [INFO] [config.py:904:print]   eigenvalue_max_iter .......... 100
  51. [2021-08-09 15:11:43,095] [INFO] [config.py:904:print]   eigenvalue_stability ......... 1e-06
  52. [2021-08-09 15:11:43,095] [INFO] [config.py:904:print]   eigenvalue_tol ............... 0.01
  53. [2021-08-09 15:11:43,095] [INFO] [config.py:904:print]   eigenvalue_verbose ........... False
  54. [2021-08-09 15:11:43,095] [INFO] [config.py:904:print]   elasticity_enabled ........... False
  55. [2021-08-09 15:11:43,095] [INFO] [config.py:904:print]   flops_profiler_config ........ {
  56.     "enabled": false,
  57.     "profile_step": 1,
  58.     "module_depth": -1,
  59.     "top_modules": 1,
  60.     "detailed": true,
  61.     "output_file": null
  62. }
  63. [2021-08-09 15:11:43,095] [INFO] [config.py:904:print]   fp16_enabled ................. False
  64. [2021-08-09 15:11:43,095] [INFO] [config.py:904:print]   fp16_mixed_quantize .......... False
  65. [2021-08-09 15:11:43,095] [INFO] [config.py:904:print]   global_rank .................. 0
  66. [2021-08-09 15:11:43,095] [INFO] [config.py:904:print]   gradient_accumulation_steps .. 1
  67. [2021-08-09 15:11:43,095] [INFO] [config.py:904:print]   gradient_clipping ............ 0.0
  68. [2021-08-09 15:11:43,095] [INFO] [config.py:904:print]   gradient_predivide_factor .... 1.0
  69. [2021-08-09 15:11:43,095] [INFO] [config.py:904:print]   initial_dynamic_scale ........ 4294967296
  70. [2021-08-09 15:11:43,095] [INFO] [config.py:904:print]   loss_scale ................... 0
  71. [2021-08-09 15:11:43,095] [INFO] [config.py:904:print]   memory_breakdown ............. False
  72. [2021-08-09 15:11:43,095] [INFO] [config.py:904:print]   optimizer_legacy_fusion ...... False
  73. [2021-08-09 15:11:43,095] [INFO] [config.py:904:print]   optimizer_name ............... None
  74. [2021-08-09 15:11:43,096] [INFO] [config.py:904:print]   optimizer_params ............. None
  75. [2021-08-09 15:11:43,096] [INFO] [config.py:904:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
  76. [2021-08-09 15:11:43,096] [INFO] [config.py:904:print]   pld_enabled .................. False
  77. [2021-08-09 15:11:43,096] [INFO] [config.py:904:print]   pld_params ................... False
  78. [2021-08-09 15:11:43,096] [INFO] [config.py:904:print]   prescale_gradients ........... False
  79. [2021-08-09 15:11:43,096] [INFO] [config.py:904:print]   quantize_change_rate ......... 0.001
  80. [2021-08-09 15:11:43,096] [INFO] [config.py:904:print]   quantize_groups .............. 1
  81. [2021-08-09 15:11:43,096] [INFO] [config.py:904:print]   quantize_offset .............. 1000
  82. [2021-08-09 15:11:43,096] [INFO] [config.py:904:print]   quantize_period .............. 1000
  83. [2021-08-09 15:11:43,096] [INFO] [config.py:904:print]   quantize_rounding ............ 0
  84. [2021-08-09 15:11:43,096] [INFO] [config.py:904:print]   quantize_start_bits .......... 16
  85. [2021-08-09 15:11:43,096] [INFO] [config.py:904:print]   quantize_target_bits ......... 8
  86. [2021-08-09 15:11:43,096] [INFO] [config.py:904:print]   quantize_training_enabled .... False
  87. [2021-08-09 15:11:43,096] [INFO] [config.py:904:print]   quantize_type ................ 0
  88. [2021-08-09 15:11:43,096] [INFO] [config.py:904:print]   quantize_verbose ............. False
  89. [2021-08-09 15:11:43,096] [INFO] [config.py:904:print]   scheduler_name ............... None
  90. [2021-08-09 15:11:43,096] [INFO] [config.py:904:print]   scheduler_params ............. None
  91. [2021-08-09 15:11:43,096] [INFO] [config.py:904:print]   sparse_attention ............. None
  92. [2021-08-09 15:11:43,096] [INFO] [config.py:904:print]   sparse_gradients_enabled ..... False
  93. [2021-08-09 15:11:43,096] [INFO] [config.py:904:print]   steps_per_print .............. 10
  94. [2021-08-09 15:11:43,096] [INFO] [config.py:904:print]   tensorboard_enabled .......... False
  95. [2021-08-09 15:11:43,096] [INFO] [config.py:904:print]   tensorboard_job_name ......... DeepSpeedJobName
  96. [2021-08-09 15:11:43,096] [INFO] [config.py:904:print]   tensorboard_output_path ......
  97. [2021-08-09 15:11:43,096] [INFO] [config.py:904:print]   train_batch_size ............. 1
  98. [2021-08-09 15:11:43,096] [INFO] [config.py:904:print]   train_micro_batch_size_per_gpu  1
  99. [2021-08-09 15:11:43,096] [INFO] [config.py:904:print]   use_quantizer_kernel ......... False
  100. [2021-08-09 15:11:43,096] [INFO] [config.py:904:print]   wall_clock_breakdown ......... False
  101. [2021-08-09 15:11:43,096] [INFO] [config.py:904:print]   world_size ................... 1
  102. [2021-08-09 15:11:43,096] [INFO] [config.py:904:print]   zero_allow_untested_optimizer  False
  103. [2021-08-09 15:11:43,096] [INFO] [config.py:904:print]   zero_config .................. {
  104.     "stage": 0,
  105.     "contiguous_gradients": true,
  106.     "reduce_scatter": true,
  107.     "reduce_bucket_size": 5.000000e+08,
  108.     "allgather_partitions": true,
  109.     "allgather_bucket_size": 5.000000e+08,
  110.     "overlap_comm": false,
  111.     "load_from_fp32_weights": true,
  112.     "elastic_checkpoint": true,
  113.     "offload_param": null,
  114.     "offload_optimizer": null,
  115.     "sub_group_size": 1.000000e+09,
  116.     "prefetch_bucket_size": 5.000000e+07,
  117.     "param_persistence_threshold": 1.000000e+05,
  118.     "max_live_parameters": 1.000000e+09,
  119.     "max_reuse_distance": 1.000000e+09,
  120.     "gather_fp16_weights_on_model_save": false,
  121.     "ignore_unused_parameters": true,
  122.     "round_robin_gradients": false,
  123.     "legacy_stage1": false
  124. }
  125. [2021-08-09 15:11:43,096] [INFO] [config.py:904:print]   zero_enabled ................. False
  126. [2021-08-09 15:11:43,096] [INFO] [config.py:904:print]   zero_optimization_stage ...... 0
  127. [2021-08-09 15:11:43,097] [INFO] [config.py:912:print]   json = {
  128.     "train_batch_size": 1
  129. }
  130. Using /home/aumam/.cache/torch_extensions as PyTorch extensions root...
  131. Emitting ninja build file /home/aumam/.cache/torch_extensions/utils/build.ninja...
  132. Building extension module utils...
  133. Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
  134. ninja: no work to do.
  135. Loading extension module utils...
  136. Time to load utils op: 0.5592496395111084 seconds
  137. Start epoch: 0 End epoch: 200
  138. Traceback (most recent call last):
  139.   File "train.py", line 223, in <module>
  140.     main()
  141.   File "train.py", line 219, in main
  142.     main_worker(save_dir, args)
  143.   File "train.py", line 166, in main_worker
  144.     train_one_epoch(epoch, model, criterion, optimizer, args, train_loader, avg_meters, logger)
  145.   File "/home/aumam/dev/gan/new_setvae/setvae/engine.py", line 24, in train_one_epoch
  146.     output = model(gt, gt_mask)
  147.   File "/home/aumam/.conda/envs/setvae/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
  148.     result = self.forward(*input, **kwargs)
  149.   File "/home/aumam/.conda/envs/setvae/lib/python3.6/site-packages/deepspeed/runtime/engine.py", line 1118, in forward
  150.     loss = self.module(*inputs, **kwargs)
  151.   File "/home/aumam/.conda/envs/setvae/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
  152.     result = self.forward(*input, **kwargs)
  153.   File "/home/aumam/dev/gan/new_setvae/setvae/models/networks.py", line 221, in forward
  154.     bup = self.bottom_up(x, x_mask)
  155.   File "/home/aumam/dev/gan/new_setvae/setvae/models/networks.py", line 183, in bottom_up
  156.     x = self.input(x)  # [B, N, D]
  157.   File "/home/aumam/.conda/envs/setvae/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
  158.     result = self.forward(*input, **kwargs)
  159.   File "/home/aumam/.conda/envs/setvae/lib/python3.6/site-packages/torch/nn/modules/linear.py", line 94, in forward
  160.     return F.linear(input, self.weight, self.bias)
  161.   File "/home/aumam/.conda/envs/setvae/lib/python3.6/site-packages/torch/nn/functional.py", line 1753, in linear
  162.     return torch._C._nn.linear(input, weight, bias)
  163. RuntimeError: CUDA error: CUBLAS_STATUS_INTERNAL_ERROR when calling `cublasCreate(handle)`
  164. Killing subprocess 43124
  165. Traceback (most recent call last):
  166.   File "/home/aumam/.conda/envs/setvae/lib/python3.6/runpy.py", line 193, in _run_module_as_main
  167.     "__main__", mod_spec)
  168.   File "/home/aumam/.conda/envs/setvae/lib/python3.6/runpy.py", line 85, in _run_code
  169.     exec(code, run_globals)
  170.   File "/home/aumam/.conda/envs/setvae/lib/python3.6/site-packages/deepspeed/launcher/launch.py", line 171, in <module>
  171.     main()
  172.   File "/home/aumam/.conda/envs/setvae/lib/python3.6/site-packages/deepspeed/launcher/launch.py", line 161, in main
  173.     sigkill_handler(signal.SIGTERM, None)  # not coming back
  174.   File "/home/aumam/.conda/envs/setvae/lib/python3.6/site-packages/deepspeed/launcher/launch.py", line 139, in sigkill_handler
  175.     raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
  176. subprocess.CalledProcessError: Command '['/home/aumam/.conda/envs/setvae/bin/python', '-u', 'train.py', '--local_rank=0', '--kl_warmup_epochs', '50', '--input_dim', '2', '--max_outputs', '400', '--init_dim', '32', '--n_mixtures', '4', '--z_dim', '16', '--z_scales', '2', '4', '8', '16', '32', '--hidden_dim', '64', '--num_heads', '4', '--lr', '1e-3', '--beta', '1e-2', '--epochs', '200', '--dataset_type', 'mnist', '--log_name', 'gen/mnist/camera-ready', '--mnist_data_dir', 'cache/mnist', '--resume_optimizer', '--save_freq', '10', '--viz_freq', '10', '--log_freq', '10', '--val_freq', '1000', '--scheduler', 'linear', '--slot_att', '--ln', '--seed', '42', '--distributed', '--deepspeed_config', 'batch_size.json']' returned non-zero exit status 1.
  177. Done
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement