Advertisement
Not a member of Pastebin yet?
Sign Up,
it unlocks many cool features!
- (procgen) ➜ procgen git:(master) ✗ ./run.sh --train
- _____ _
- /\ |_ _| | |
- / \ | | ___ _ __ _____ ____| |
- / /\ \ | | / __| '__/ _ \ \ /\ / / _ |
- / ____ \ _| || (__| | | (_) \ V V / (_| |
- /_/ \_\_____\___|_| \___/ \_/\_/ \__,_|
- Executing: python train.py -f experiments/impala-stacked-2-cpus.yaml --ray-memory 55000000 --ray-num-cpus 2 --ray-object-store-memory 80000000
- {'contrib/RandomAgent': <function _import_random_agent at 0x7f4cd168f048>, 'contrib/MADDPG': <function _import_maddpg at 0x7f4cd168f0d0>, 'contrib/AlphaZero': <function _import_alphazero at 0x7f4cd168f158>, 'contrib/LinTS': <function _import_bandit_lints at 0x7f4cd168f1e0>, 'contrib/LinUCB': <function _import_bandit_linucb at 0x7f4cd168f268>}
- 2020-07-30 11:17:04,064 INFO resource_spec.py:212 -- Starting Ray with 0.05 GiB memory available for workers and up to 0.07 GiB for objects. You can adjust these settings with ray.init(memory=<bytes>, object_store_memory=<bytes>).
- 2020-07-30 11:17:04,433 INFO services.py:1170 -- View the Ray dashboard at localhost:8265
- == Status ==
- Memory usage on this node: 1.5/60.0 GiB
- Using FIFO scheduling algorithm.
- Resources requested: 2.0/2 CPUs, 0.7/1 GPUs, 0.0/0.05 GiB heap, 0.0/0.05 GiB objects
- Result logdir: /home/ubuntu/ray_results/stacked_and_batch_size4k
- Number of trials: 1 (1 RUNNING)
- +-------------------------------+----------+-------+
- | Trial name | status | loc |
- |-------------------------------+----------+-------|
- | PPO_stacked_procgen_env_00000 | RUNNING | |
- +-------------------------------+----------+-------+
- (pid=16904) 2020-07-30 11:17:07,477 INFO trainer.py:421 -- Tip: set 'eager': true or the --eager flag to enable TensorFlow eager execution
- (pid=16904) 2020-07-30 11:17:07,503 INFO trainer.py:580 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
- (pid=16904) 2020-07-30 11:17:12,894 INFO trainable.py:217 -- Getting current IP.
- (pid=16904) 2020-07-30 11:17:12,895 WARNING util.py:37 -- Install gputil for GPU system monitoring.
- 2020-07-30 11:17:13,356 WARNING worker.py:1090 -- WARNING: 6 PYTHON workers have been started. This could be a result of using a large number of actors, or it could be a consequence of using nested tasks (see https://github.com/ray-project/ray/issues/3644) for some a discussion of workarounds.
- (pid=16967) E0730 11:17:25.027956 16967 plasma_store_provider.cc:108] Failed to put object 21a8a4446e604fa5d03b8d12010000c801000000 in object store because it is full. Object size is 302591732 bytes.
- (pid=16967) Waiting 1000ms for space to free up...
- 2020-07-30 11:17:25,205 INFO (unknown file):0 -- gc.collect() freed 84 refs in 0.14950760600004287 seconds
- (pid=16904) 2020-07-30 11:17:25,178 INFO (unknown file):0 -- gc.collect() freed 3 refs in 0.12236754700006713 seconds
- (pid=16966) E0730 11:17:25.197851 16966 plasma_store_provider.cc:108] Failed to put object 4e8e6bbb00a431564d81fd5d010000c801000000 in object store because it is full. Object size is 302591732 bytes.
- (pid=16966) Waiting 1000ms for space to free up...
- (pid=16903) E0730 11:17:25.531816 16903 plasma_store_provider.cc:108] Failed to put object 51fd2849438db5632512146c010000c801000000 in object store because it is full. Object size is 302591732 bytes.
- (pid=16903) Waiting 1000ms for space to free up...
- (pid=16991) E0730 11:17:25.914304 16991 plasma_store_provider.cc:108] Failed to put object e7692311122d9c277a78cec9010000c801000000 in object store because it is full. Object size is 302591732 bytes.
- (pid=16991) Waiting 1000ms for space to free up...
- (pid=16967) E0730 11:17:26.028805 16967 plasma_store_provider.cc:108] Failed to put object 21a8a4446e604fa5d03b8d12010000c801000000 in object store because it is full. Object size is 302591732 bytes.
- (pid=16967) Waiting 2000ms for space to free up...
- (pid=16966) E0730 11:17:26.198580 16966 plasma_store_provider.cc:108] Failed to put object 4e8e6bbb00a431564d81fd5d010000c801000000 in object store because it is full. Object size is 302591732 bytes.
- (pid=16966) Waiting 2000ms for space to free up...
- (pid=16903) E0730 11:17:26.532692 16903 plasma_store_provider.cc:108] Failed to put object 51fd2849438db5632512146c010000c801000000 in object store because it is full. Object size is 302591732 bytes.
- (pid=16903) Waiting 2000ms for space to free up...
- (pid=16991) E0730 11:17:26.915047 16991 plasma_store_provider.cc:108] Failed to put object e7692311122d9c277a78cec9010000c801000000 in object store because it is full. Object size is 302591732 bytes.
- (pid=16991) Waiting 2000ms for space to free up...
- (pid=16967) E0730 11:17:28.029335 16967 plasma_store_provider.cc:108] Failed to put object 21a8a4446e604fa5d03b8d12010000c801000000 in object store because it is full. Object size is 302591732 bytes.
- (pid=16967) Waiting 4000ms for space to free up...
- (pid=16966) E0730 11:17:28.198992 16966 plasma_store_provider.cc:108] Failed to put object 4e8e6bbb00a431564d81fd5d010000c801000000 in object store because it is full. Object size is 302591732 bytes.
- (pid=16966) Waiting 4000ms for space to free up...
- (pid=16903) E0730 11:17:28.534523 16903 plasma_store_provider.cc:108] Failed to put object 51fd2849438db5632512146c010000c801000000 in object store because it is full. Object size is 302591732 bytes.
- (pid=16903) Waiting 4000ms for space to free up...
- (pid=16991) E0730 11:17:28.915459 16991 plasma_store_provider.cc:108] Failed to put object e7692311122d9c277a78cec9010000c801000000 in object store because it is full. Object size is 302591732 bytes.
- (pid=16991) Waiting 4000ms for space to free up...
- (pid=16967) E0730 11:17:32.029805 16967 plasma_store_provider.cc:108] Failed to put object 21a8a4446e604fa5d03b8d12010000c801000000 in object store because it is full. Object size is 302591732 bytes.
- (pid=16967) Waiting 8000ms for space to free up...
- (pid=16966) E0730 11:17:32.199393 16966 plasma_store_provider.cc:108] Failed to put object 4e8e6bbb00a431564d81fd5d010000c801000000 in object store because it is full. Object size is 302591732 bytes.
- (pid=16966) Waiting 8000ms for space to free up...
- (pid=16903) E0730 11:17:32.535001 16903 plasma_store_provider.cc:108] Failed to put object 51fd2849438db5632512146c010000c801000000 in object store because it is full. Object size is 302591732 bytes.
- (pid=16903) Waiting 8000ms for space to free up...
- (pid=16991) E0730 11:17:32.915891 16991 plasma_store_provider.cc:108] Failed to put object e7692311122d9c277a78cec9010000c801000000 in object store because it is full. Object size is 302591732 bytes.
- (pid=16991) Waiting 8000ms for space to free up...
- (pid=16967) E0730 11:17:40.030243 16967 plasma_store_provider.cc:108] Failed to put object 21a8a4446e604fa5d03b8d12010000c801000000 in object store because it is full. Object size is 302591732 bytes.
- (pid=16967) Waiting 16000ms for space to free up...
- (pid=16966) E0730 11:17:40.199820 16966 plasma_store_provider.cc:108] Failed to put object 4e8e6bbb00a431564d81fd5d010000c801000000 in object store because it is full. Object size is 302591732 bytes.
- (pid=16966) Waiting 16000ms for space to free up...
- (pid=16903) E0730 11:17:40.535482 16903 plasma_store_provider.cc:108] Failed to put object 51fd2849438db5632512146c010000c801000000 in object store because it is full. Object size is 302591732 bytes.
- (pid=16903) Waiting 16000ms for space to free up...
- (pid=16991) E0730 11:17:40.916352 16991 plasma_store_provider.cc:108] Failed to put object e7692311122d9c277a78cec9010000c801000000 in object store because it is full. Object size is 302591732 bytes.
- (pid=16991) Waiting 16000ms for space to free up...
- 2020-07-30 11:17:56,036 ERROR trial_runner.py:519 -- Trial PPO_stacked_procgen_env_00000: Error processing event.
- Traceback (most recent call last):
- File "/home/ubuntu/procgen/lib/python3.6/site-packages/ray/tune/trial_runner.py", line 467, in _process_trial
- result = self.trial_executor.fetch_result(trial)
- File "/home/ubuntu/procgen/lib/python3.6/site-packages/ray/tune/ray_trial_executor.py", line 431, in fetch_result
- result = ray.get(trial_future[0], DEFAULT_GET_TIMEOUT)
- File "/home/ubuntu/procgen/lib/python3.6/site-packages/ray/worker.py", line 1515, in get
- raise value.as_instanceof_cause()
- ray.exceptions.RayTaskError: ray::PPO.train() (pid=16904, ip=172.31.27.29)
- File "python/ray/_raylet.pyx", line 463, in ray._raylet.execute_task
- File "python/ray/_raylet.pyx", line 417, in ray._raylet.execute_task.function_executor
- File "/home/ubuntu/procgen/lib/python3.6/site-packages/ray/rllib/agents/trainer.py", line 495, in train
- raise e
- File "/home/ubuntu/procgen/lib/python3.6/site-packages/ray/rllib/agents/trainer.py", line 484, in train
- result = Trainable.train(self)
- File "/home/ubuntu/procgen/lib/python3.6/site-packages/ray/tune/trainable.py", line 261, in train
- result = self._train()
- File "/home/ubuntu/procgen/lib/python3.6/site-packages/ray/rllib/agents/trainer_template.py", line 151, in _train
- fetches = self.optimizer.step()
- File "/home/ubuntu/procgen/lib/python3.6/site-packages/ray/rllib/optimizers/sync_samples_optimizer.py", line 59, in step
- for e in self.workers.remote_workers()
- File "/home/ubuntu/procgen/lib/python3.6/site-packages/ray/rllib/utils/memory.py", line 32, in ray_get_and_free
- return ray.get(object_ids)
- ray.exceptions.RayTaskError: ray::RolloutWorker.sample() (pid=16967, ip=172.31.27.29)
- File "python/ray/_raylet.pyx", line 477, in ray._raylet.execute_task
- File "python/ray/_raylet.pyx", line 478, in ray._raylet.execute_task
- File "python/ray/_raylet.pyx", line 1151, in ray._raylet.CoreWorker.store_task_outputs
- File "python/ray/_raylet.pyx", line 136, in ray._raylet.check_status
- ray.exceptions.ObjectStoreFullError: Failed to put object 21a8a4446e604fa5d03b8d12010000c801000000 in object store because it is full. Object size is 302591732 bytes.
- The local object store is full of objects that are still in scope and cannot be evicted. Try increasing the object store memory available with ray.init(object_store_memory=<bytes>). You can also try setting an option to fallback to LRU eviction when the object store is full by calling ray.init(lru_evict=True). See also: https://docs.ray.io/en/latest/memory-management.html.
- == Status ==
- Memory usage on this node: 15.8/60.0 GiB
- Using FIFO scheduling algorithm.
- Resources requested: 0.0/2 CPUs, 0.0/1 GPUs, 0.0/0.05 GiB heap, 0.0/0.05 GiB objects
- Result logdir: /home/ubuntu/ray_results/stacked_and_batch_size4k
- Number of trials: 1 (1 ERROR)
- +-------------------------------+----------+-------+
- | Trial name | status | loc |
- |-------------------------------+----------+-------|
- | PPO_stacked_procgen_env_00000 | ERROR | |
- +-------------------------------+----------+-------+
- Number of errored trials: 1
- +-------------------------------+--------------+-------------------------------------------------------------------------------------------------------------------+
- | Trial name | # failures | error file |
- |-------------------------------+--------------+-------------------------------------------------------------------------------------------------------------------|
- | PPO_stacked_procgen_env_00000 | 1 | /home/ubuntu/ray_results/stacked_and_batch_size4k/PPO_stacked_procgen_env_0_2020-07-30_11-17-057h6nqsxp/error.txt |
- +-------------------------------+--------------+-------------------------------------------------------------------------------------------------------------------+
- == Status ==
- Memory usage on this node: 15.8/60.0 GiB
- Using FIFO scheduling algorithm.
- Resources requested: 0.0/2 CPUs, 0.0/1 GPUs, 0.0/0.05 GiB heap, 0.0/0.05 GiB objects
- Result logdir: /home/ubuntu/ray_results/stacked_and_batch_size4k
- Number of trials: 1 (1 ERROR)
- +-------------------------------+----------+-------+
- | Trial name | status | loc |
- |-------------------------------+----------+-------|
- | PPO_stacked_procgen_env_00000 | ERROR | |
- +-------------------------------+----------+-------+
- Number of errored trials: 1
- +-------------------------------+--------------+-------------------------------------------------------------------------------------------------------------------+
- | Trial name | # failures | error file |
- |-------------------------------+--------------+-------------------------------------------------------------------------------------------------------------------|
- | PPO_stacked_procgen_env_00000 | 1 | /home/ubuntu/ray_results/stacked_and_batch_size4k/PPO_stacked_procgen_env_0_2020-07-30_11-17-057h6nqsxp/error.txt |
- +-------------------------------+--------------+-------------------------------------------------------------------------------------------------------------------+
- Traceback (most recent call last):
- File "train.py", line 238, in <module>
- run(args, parser)
- File "train.py", line 232, in run
- concurrent=True)
- File "/home/ubuntu/procgen/lib/python3.6/site-packages/ray/tune/tune.py", line 411, in run_experiments
- return_trials=True)
- File "/home/ubuntu/procgen/lib/python3.6/site-packages/ray/tune/tune.py", line 347, in run
- raise TuneError("Trials did not complete", incomplete_trials)
- ray.tune.error.TuneError: ('Trials did not complete', [PPO_stacked_procgen_env_00000])
- (pid=16967) E0730 11:17:56.030732 16967 plasma_store_provider.cc:118] Failed to put object 21a8a4446e604fa5d03b8d12010000c801000000 after 6 attempts. Plasma store status:
- (pid=16967) num clients with quota: 0
- (pid=16967) quota map size: 0
- (pid=16967) pinned quota map size: 0
- (pid=16967) allocated bytes: 2513348
- (pid=16967) allocation limit: 80000000
- (pid=16967) pinned bytes: 2513348
- (pid=16967) (global lru) capacity: 80000000
- (pid=16967) (global lru) used: 0%
- (pid=16967) (global lru) num objects: 0
- (pid=16967) (global lru) num evictions: 0
- (pid=16967) (global lru) bytes evicted: 0
- (pid=16967) ---
- (pid=16967) --- Tip: Use the `ray memory` command to list active objects in the cluster.
- (pid=16967) ---
- (pid=16966) E0730 11:17:56.200342 16966 plasma_store_provider.cc:118] Failed to put object 4e8e6bbb00a431564d81fd5d010000c801000000 after 6 attempts. Plasma store status:
- (pid=16966) num clients with quota: 0
- (pid=16966) quota map size: 0
- (pid=16966) pinned quota map size: 0
- (pid=16966) allocated bytes: 2513348
- (pid=16966) allocation limit: 80000000
- (pid=16966) pinned bytes: 2513348
- (pid=16966) (global lru) capacity: 80000000
- (pid=16966) (global lru) used: 0%
- (pid=16966) (global lru) num objects: 0
- (pid=16966) (global lru) num evictions: 0
- (pid=16966) (global lru) bytes evicted: 0
- (pid=16966) ---
- (pid=16966) --- Tip: Use the `ray memory` command to list active objects in the cluster.
- (pid=16966) ---
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement