Guest User

Untitled

a guest
Jul 30th, 2020
17
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
  1. (procgen) ➜ procgen git:(master) ✗ ./run.sh --train
  2. _____ _
  3. /\ |_ _| | |
  4. / \ | | ___ _ __ _____ ____| |
  5. / /\ \ | | / __| '__/ _ \ \ /\ / / _ |
  6. / ____ \ _| || (__| | | (_) \ V V / (_| |
  7. /_/ \_\_____\___|_| \___/ \_/\_/ \__,_|
  8.  
  9. Executing: python train.py -f experiments/impala-stacked-2-cpus.yaml --ray-memory 55000000 --ray-num-cpus 2 --ray-object-store-memory 80000000
  10. {'contrib/RandomAgent': <function _import_random_agent at 0x7f4cd168f048>, 'contrib/MADDPG': <function _import_maddpg at 0x7f4cd168f0d0>, 'contrib/AlphaZero': <function _import_alphazero at 0x7f4cd168f158>, 'contrib/LinTS': <function _import_bandit_lints at 0x7f4cd168f1e0>, 'contrib/LinUCB': <function _import_bandit_linucb at 0x7f4cd168f268>}
  11. 2020-07-30 11:17:04,064 INFO resource_spec.py:212 -- Starting Ray with 0.05 GiB memory available for workers and up to 0.07 GiB for objects. You can adjust these settings with ray.init(memory=<bytes>, object_store_memory=<bytes>).
  12. 2020-07-30 11:17:04,433 INFO services.py:1170 -- View the Ray dashboard at localhost:8265
  13. == Status ==
  14. Memory usage on this node: 1.5/60.0 GiB
  15. Using FIFO scheduling algorithm.
  16. Resources requested: 2.0/2 CPUs, 0.7/1 GPUs, 0.0/0.05 GiB heap, 0.0/0.05 GiB objects
  17. Result logdir: /home/ubuntu/ray_results/stacked_and_batch_size4k
  18. Number of trials: 1 (1 RUNNING)
  19. +-------------------------------+----------+-------+
  20. | Trial name | status | loc |
  21. |-------------------------------+----------+-------|
  22. | PPO_stacked_procgen_env_00000 | RUNNING | |
  23. +-------------------------------+----------+-------+
  24.  
  25.  
  26. (pid=16904) 2020-07-30 11:17:07,477 INFO trainer.py:421 -- Tip: set 'eager': true or the --eager flag to enable TensorFlow eager execution
  27. (pid=16904) 2020-07-30 11:17:07,503 INFO trainer.py:580 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
  28. (pid=16904) 2020-07-30 11:17:12,894 INFO trainable.py:217 -- Getting current IP.
  29. (pid=16904) 2020-07-30 11:17:12,895 WARNING util.py:37 -- Install gputil for GPU system monitoring.
  30. 2020-07-30 11:17:13,356 WARNING worker.py:1090 -- WARNING: 6 PYTHON workers have been started. This could be a result of using a large number of actors, or it could be a consequence of using nested tasks (see https://github.com/ray-project/ray/issues/3644) for some a discussion of workarounds.
  31. (pid=16967) E0730 11:17:25.027956 16967 plasma_store_provider.cc:108] Failed to put object 21a8a4446e604fa5d03b8d12010000c801000000 in object store because it is full. Object size is 302591732 bytes.
  32. (pid=16967) Waiting 1000ms for space to free up...
  33. 2020-07-30 11:17:25,205 INFO (unknown file):0 -- gc.collect() freed 84 refs in 0.14950760600004287 seconds
  34. (pid=16904) 2020-07-30 11:17:25,178 INFO (unknown file):0 -- gc.collect() freed 3 refs in 0.12236754700006713 seconds
  35. (pid=16966) E0730 11:17:25.197851 16966 plasma_store_provider.cc:108] Failed to put object 4e8e6bbb00a431564d81fd5d010000c801000000 in object store because it is full. Object size is 302591732 bytes.
  36. (pid=16966) Waiting 1000ms for space to free up...
  37. (pid=16903) E0730 11:17:25.531816 16903 plasma_store_provider.cc:108] Failed to put object 51fd2849438db5632512146c010000c801000000 in object store because it is full. Object size is 302591732 bytes.
  38. (pid=16903) Waiting 1000ms for space to free up...
  39. (pid=16991) E0730 11:17:25.914304 16991 plasma_store_provider.cc:108] Failed to put object e7692311122d9c277a78cec9010000c801000000 in object store because it is full. Object size is 302591732 bytes.
  40. (pid=16991) Waiting 1000ms for space to free up...
  41. (pid=16967) E0730 11:17:26.028805 16967 plasma_store_provider.cc:108] Failed to put object 21a8a4446e604fa5d03b8d12010000c801000000 in object store because it is full. Object size is 302591732 bytes.
  42. (pid=16967) Waiting 2000ms for space to free up...
  43. (pid=16966) E0730 11:17:26.198580 16966 plasma_store_provider.cc:108] Failed to put object 4e8e6bbb00a431564d81fd5d010000c801000000 in object store because it is full. Object size is 302591732 bytes.
  44. (pid=16966) Waiting 2000ms for space to free up...
  45. (pid=16903) E0730 11:17:26.532692 16903 plasma_store_provider.cc:108] Failed to put object 51fd2849438db5632512146c010000c801000000 in object store because it is full. Object size is 302591732 bytes.
  46. (pid=16903) Waiting 2000ms for space to free up...
  47. (pid=16991) E0730 11:17:26.915047 16991 plasma_store_provider.cc:108] Failed to put object e7692311122d9c277a78cec9010000c801000000 in object store because it is full. Object size is 302591732 bytes.
  48. (pid=16991) Waiting 2000ms for space to free up...
  49. (pid=16967) E0730 11:17:28.029335 16967 plasma_store_provider.cc:108] Failed to put object 21a8a4446e604fa5d03b8d12010000c801000000 in object store because it is full. Object size is 302591732 bytes.
  50. (pid=16967) Waiting 4000ms for space to free up...
  51. (pid=16966) E0730 11:17:28.198992 16966 plasma_store_provider.cc:108] Failed to put object 4e8e6bbb00a431564d81fd5d010000c801000000 in object store because it is full. Object size is 302591732 bytes.
  52. (pid=16966) Waiting 4000ms for space to free up...
  53. (pid=16903) E0730 11:17:28.534523 16903 plasma_store_provider.cc:108] Failed to put object 51fd2849438db5632512146c010000c801000000 in object store because it is full. Object size is 302591732 bytes.
  54. (pid=16903) Waiting 4000ms for space to free up...
  55. (pid=16991) E0730 11:17:28.915459 16991 plasma_store_provider.cc:108] Failed to put object e7692311122d9c277a78cec9010000c801000000 in object store because it is full. Object size is 302591732 bytes.
  56. (pid=16991) Waiting 4000ms for space to free up...
  57. (pid=16967) E0730 11:17:32.029805 16967 plasma_store_provider.cc:108] Failed to put object 21a8a4446e604fa5d03b8d12010000c801000000 in object store because it is full. Object size is 302591732 bytes.
  58. (pid=16967) Waiting 8000ms for space to free up...
  59. (pid=16966) E0730 11:17:32.199393 16966 plasma_store_provider.cc:108] Failed to put object 4e8e6bbb00a431564d81fd5d010000c801000000 in object store because it is full. Object size is 302591732 bytes.
  60. (pid=16966) Waiting 8000ms for space to free up...
  61. (pid=16903) E0730 11:17:32.535001 16903 plasma_store_provider.cc:108] Failed to put object 51fd2849438db5632512146c010000c801000000 in object store because it is full. Object size is 302591732 bytes.
  62. (pid=16903) Waiting 8000ms for space to free up...
  63. (pid=16991) E0730 11:17:32.915891 16991 plasma_store_provider.cc:108] Failed to put object e7692311122d9c277a78cec9010000c801000000 in object store because it is full. Object size is 302591732 bytes.
  64. (pid=16991) Waiting 8000ms for space to free up...
  65. (pid=16967) E0730 11:17:40.030243 16967 plasma_store_provider.cc:108] Failed to put object 21a8a4446e604fa5d03b8d12010000c801000000 in object store because it is full. Object size is 302591732 bytes.
  66. (pid=16967) Waiting 16000ms for space to free up...
  67. (pid=16966) E0730 11:17:40.199820 16966 plasma_store_provider.cc:108] Failed to put object 4e8e6bbb00a431564d81fd5d010000c801000000 in object store because it is full. Object size is 302591732 bytes.
  68. (pid=16966) Waiting 16000ms for space to free up...
  69. (pid=16903) E0730 11:17:40.535482 16903 plasma_store_provider.cc:108] Failed to put object 51fd2849438db5632512146c010000c801000000 in object store because it is full. Object size is 302591732 bytes.
  70. (pid=16903) Waiting 16000ms for space to free up...
  71. (pid=16991) E0730 11:17:40.916352 16991 plasma_store_provider.cc:108] Failed to put object e7692311122d9c277a78cec9010000c801000000 in object store because it is full. Object size is 302591732 bytes.
  72. (pid=16991) Waiting 16000ms for space to free up...
  73. 2020-07-30 11:17:56,036 ERROR trial_runner.py:519 -- Trial PPO_stacked_procgen_env_00000: Error processing event.
  74. Traceback (most recent call last):
  75. File "/home/ubuntu/procgen/lib/python3.6/site-packages/ray/tune/trial_runner.py", line 467, in _process_trial
  76. result = self.trial_executor.fetch_result(trial)
  77. File "/home/ubuntu/procgen/lib/python3.6/site-packages/ray/tune/ray_trial_executor.py", line 431, in fetch_result
  78. result = ray.get(trial_future[0], DEFAULT_GET_TIMEOUT)
  79. File "/home/ubuntu/procgen/lib/python3.6/site-packages/ray/worker.py", line 1515, in get
  80. raise value.as_instanceof_cause()
  81. ray.exceptions.RayTaskError: ray::PPO.train() (pid=16904, ip=172.31.27.29)
  82. File "python/ray/_raylet.pyx", line 463, in ray._raylet.execute_task
  83. File "python/ray/_raylet.pyx", line 417, in ray._raylet.execute_task.function_executor
  84. File "/home/ubuntu/procgen/lib/python3.6/site-packages/ray/rllib/agents/trainer.py", line 495, in train
  85. raise e
  86. File "/home/ubuntu/procgen/lib/python3.6/site-packages/ray/rllib/agents/trainer.py", line 484, in train
  87. result = Trainable.train(self)
  88. File "/home/ubuntu/procgen/lib/python3.6/site-packages/ray/tune/trainable.py", line 261, in train
  89. result = self._train()
  90. File "/home/ubuntu/procgen/lib/python3.6/site-packages/ray/rllib/agents/trainer_template.py", line 151, in _train
  91. fetches = self.optimizer.step()
  92. File "/home/ubuntu/procgen/lib/python3.6/site-packages/ray/rllib/optimizers/sync_samples_optimizer.py", line 59, in step
  93. for e in self.workers.remote_workers()
  94. File "/home/ubuntu/procgen/lib/python3.6/site-packages/ray/rllib/utils/memory.py", line 32, in ray_get_and_free
  95. return ray.get(object_ids)
  96. ray.exceptions.RayTaskError: ray::RolloutWorker.sample() (pid=16967, ip=172.31.27.29)
  97. File "python/ray/_raylet.pyx", line 477, in ray._raylet.execute_task
  98. File "python/ray/_raylet.pyx", line 478, in ray._raylet.execute_task
  99. File "python/ray/_raylet.pyx", line 1151, in ray._raylet.CoreWorker.store_task_outputs
  100. File "python/ray/_raylet.pyx", line 136, in ray._raylet.check_status
  101. ray.exceptions.ObjectStoreFullError: Failed to put object 21a8a4446e604fa5d03b8d12010000c801000000 in object store because it is full. Object size is 302591732 bytes.
  102. The local object store is full of objects that are still in scope and cannot be evicted. Try increasing the object store memory available with ray.init(object_store_memory=<bytes>). You can also try setting an option to fallback to LRU eviction when the object store is full by calling ray.init(lru_evict=True). See also: https://docs.ray.io/en/latest/memory-management.html.
  103. == Status ==
  104. Memory usage on this node: 15.8/60.0 GiB
  105. Using FIFO scheduling algorithm.
  106. Resources requested: 0.0/2 CPUs, 0.0/1 GPUs, 0.0/0.05 GiB heap, 0.0/0.05 GiB objects
  107. Result logdir: /home/ubuntu/ray_results/stacked_and_batch_size4k
  108. Number of trials: 1 (1 ERROR)
  109. +-------------------------------+----------+-------+
  110. | Trial name | status | loc |
  111. |-------------------------------+----------+-------|
  112. | PPO_stacked_procgen_env_00000 | ERROR | |
  113. +-------------------------------+----------+-------+
  114. Number of errored trials: 1
  115. +-------------------------------+--------------+-------------------------------------------------------------------------------------------------------------------+
  116. | Trial name | # failures | error file |
  117. |-------------------------------+--------------+-------------------------------------------------------------------------------------------------------------------|
  118. | PPO_stacked_procgen_env_00000 | 1 | /home/ubuntu/ray_results/stacked_and_batch_size4k/PPO_stacked_procgen_env_0_2020-07-30_11-17-057h6nqsxp/error.txt |
  119. +-------------------------------+--------------+-------------------------------------------------------------------------------------------------------------------+
  120.  
  121. == Status ==
  122. Memory usage on this node: 15.8/60.0 GiB
  123. Using FIFO scheduling algorithm.
  124. Resources requested: 0.0/2 CPUs, 0.0/1 GPUs, 0.0/0.05 GiB heap, 0.0/0.05 GiB objects
  125. Result logdir: /home/ubuntu/ray_results/stacked_and_batch_size4k
  126. Number of trials: 1 (1 ERROR)
  127. +-------------------------------+----------+-------+
  128. | Trial name | status | loc |
  129. |-------------------------------+----------+-------|
  130. | PPO_stacked_procgen_env_00000 | ERROR | |
  131. +-------------------------------+----------+-------+
  132. Number of errored trials: 1
  133. +-------------------------------+--------------+-------------------------------------------------------------------------------------------------------------------+
  134. | Trial name | # failures | error file |
  135. |-------------------------------+--------------+-------------------------------------------------------------------------------------------------------------------|
  136. | PPO_stacked_procgen_env_00000 | 1 | /home/ubuntu/ray_results/stacked_and_batch_size4k/PPO_stacked_procgen_env_0_2020-07-30_11-17-057h6nqsxp/error.txt |
  137. +-------------------------------+--------------+-------------------------------------------------------------------------------------------------------------------+
  138.  
  139. Traceback (most recent call last):
  140. File "train.py", line 238, in <module>
  141. run(args, parser)
  142. File "train.py", line 232, in run
  143. concurrent=True)
  144. File "/home/ubuntu/procgen/lib/python3.6/site-packages/ray/tune/tune.py", line 411, in run_experiments
  145. return_trials=True)
  146. File "/home/ubuntu/procgen/lib/python3.6/site-packages/ray/tune/tune.py", line 347, in run
  147. raise TuneError("Trials did not complete", incomplete_trials)
  148. ray.tune.error.TuneError: ('Trials did not complete', [PPO_stacked_procgen_env_00000])
  149. (pid=16967) E0730 11:17:56.030732 16967 plasma_store_provider.cc:118] Failed to put object 21a8a4446e604fa5d03b8d12010000c801000000 after 6 attempts. Plasma store status:
  150. (pid=16967) num clients with quota: 0
  151. (pid=16967) quota map size: 0
  152. (pid=16967) pinned quota map size: 0
  153. (pid=16967) allocated bytes: 2513348
  154. (pid=16967) allocation limit: 80000000
  155. (pid=16967) pinned bytes: 2513348
  156. (pid=16967) (global lru) capacity: 80000000
  157. (pid=16967) (global lru) used: 0%
  158. (pid=16967) (global lru) num objects: 0
  159. (pid=16967) (global lru) num evictions: 0
  160. (pid=16967) (global lru) bytes evicted: 0
  161. (pid=16967) ---
  162. (pid=16967) --- Tip: Use the `ray memory` command to list active objects in the cluster.
  163. (pid=16967) ---
  164. (pid=16966) E0730 11:17:56.200342 16966 plasma_store_provider.cc:118] Failed to put object 4e8e6bbb00a431564d81fd5d010000c801000000 after 6 attempts. Plasma store status:
  165. (pid=16966) num clients with quota: 0
  166. (pid=16966) quota map size: 0
  167. (pid=16966) pinned quota map size: 0
  168. (pid=16966) allocated bytes: 2513348
  169. (pid=16966) allocation limit: 80000000
  170. (pid=16966) pinned bytes: 2513348
  171. (pid=16966) (global lru) capacity: 80000000
  172. (pid=16966) (global lru) used: 0%
  173. (pid=16966) (global lru) num objects: 0
  174. (pid=16966) (global lru) num evictions: 0
  175. (pid=16966) (global lru) bytes evicted: 0
  176. (pid=16966) ---
  177. (pid=16966) --- Tip: Use the `ray memory` command to list active objects in the cluster.
  178. (pid=16966) ---
RAW Paste Data