Advertisement
Not a member of Pastebin yet?
Sign Up,
it unlocks many cool features!
- 2020-01-14 23:40:32,422 - INFO - Starting epoch 0
- ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
- ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
- ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
- Traceback (most recent call last):
- File "/usr/lib/python3.6/multiprocessing/queues.py", line 234, in _feed
- obj = _ForkingPickler.dumps(obj)
- File "/usr/lib/python3.6/multiprocessing/reduction.py", line 51, in dumps
- cls(buf, protocol).dump(obj)
- File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/reductions.py", line 333, in reduce_storage
- fd, size = storage._share_fd_()
- RuntimeError: unable to write to file </torch_176_1132937539>
- Traceback (most recent call last):
- File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 724, in _try_get_data
- data = self._data_queue.get(timeout=timeout)
- File "/usr/lib/python3.6/multiprocessing/queues.py", line 104, in get
- if not self._poll(timeout):
- File "/usr/lib/python3.6/multiprocessing/connection.py", line 257, in poll
- return self._poll(timeout)
- File "/usr/lib/python3.6/multiprocessing/connection.py", line 414, in _poll
- r = wait([self], timeout)
- File "/usr/lib/python3.6/multiprocessing/connection.py", line 911, in wait
- ready = selector.select(timeout)
- File "/usr/lib/python3.6/selectors.py", line 376, in select
- fd_event_list = self._poll.poll(timeout)
- File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/signal_handling.py", line 63, in handler
- def handler(signum, frame):
- File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
- _error_if_any_worker_fails()
- RuntimeError: DataLoader worker (pid 177) is killed by signal: Bus error. It is possible that dataloader's workers are out of shared memory. Please try to raise your shared memory limit.
- During handling of the above exception, another exception occurred:
- Traceback (most recent call last):
- File "NeMo/jasper.py", line 342, in <module>
- main()
- File "NeMo/jasper.py", line 338, in main
- stop_on_nan_loss=args.stop_on_nan_loss)
- File "/home/jovyan/libs/nemo/core/neural_factory.py", line 616, in train
- gradient_predivide=gradient_predivide)
- File "/home/jovyan/libs/nemo/backends/pytorch/actions.py", line 1405, in train
- for _, data in enumerate(train_dataloader, 0):
- File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 804, in __next__
- idx, data = self._get_data()
- File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 771, in _get_data
- success, data = self._try_get_data()
- File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 737, in _try_get_data
- raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str))
- RuntimeError: DataLoader worker (pid(s) 177, 178, 180) exited unexpectedly
- Traceback (most recent call last):
- File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
- "__main__", mod_spec)
- File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
- exec(code, run_globals)
- File "/usr/local/lib/python3.6/dist-packages/torch/distributed/launch.py", line 253, in <module>
- main()
- File "/usr/local/lib/python3.6/dist-packages/torch/distributed/launch.py", line 249, in main
- cmd=cmd)
- subprocess.CalledProcessError: Command '['/usr/bin/python', '-u', 'NeMo/jasper.py', '--local_rank=0', '--max_steps', '200000', '--model_config', 'NeMo/jasper10x5_ru.yaml']' returned non-zero exit status 1.
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement