Guest User

Untitled

a guest
Nov 29th, 2022
95
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
  1. 1.  This command now runs correctly
  2. (py3.9) ➜  /share mpirun -n 2 --machinefile hosts --mca plm_base_verbose 100 --mca rmaps_base_verbose 100 --mca ras_base_verbose 100 uptime
  3.  
  4.  
  5. 2. But this command gets stuck. It seems to be the mpi program that gets stuck.
  6. test.py:
  7. import mpi4py
  8. from mpi4py import MPI
  9.  
  10. (py3.9) ➜  /share mpirun -n 2 --machinefile hosts --mca plm_base_verbose 100 --mca rmaps_base_verbose 100 --mca ras_base_verbose 100 python test.py
  11. [computer01:47982] mca: base: component_find: searching NULL for plm components
  12. [computer01:47982] mca: base: find_dyn_components: checking NULL for plm components
  13. [computer01:47982] pmix:mca: base: components_register: registering framework plm components
  14. [computer01:47982] pmix:mca: base: components_register: found loaded component slurm
  15. [computer01:47982] pmix:mca: base: components_register: component slurm register function successful
  16. [computer01:47982] pmix:mca: base: components_register: found loaded component ssh
  17. [computer01:47982] pmix:mca: base: components_register: component ssh register function successful
  18. [computer01:47982] mca: base: components_open: opening plm components
  19. [computer01:47982] mca: base: components_open: found loaded component slurm
  20. [computer01:47982] mca: base: components_open: component slurm open function successful
  21. [computer01:47982] mca: base: components_open: found loaded component ssh
  22. [computer01:47982] mca: base: components_open: component ssh open function successful
  23. [computer01:47982] mca:base:select: Auto-selecting plm components
  24. [computer01:47982] mca:base:select:(  plm) Querying component [slurm]
  25. [computer01:47982] mca:base:select:(  plm) Querying component [ssh]
  26. [computer01:47982] [[INVALID],0] plm:ssh_lookup on agent ssh : rsh path NULL
  27. [computer01:47982] mca:base:select:(  plm) Query of component [ssh] set priority to 10
  28. [computer01:47982] mca:base:select:(  plm) Selected component [ssh]
  29. [computer01:47982] mca: base: close: component slurm closed
  30. [computer01:47982] mca: base: close: unloading component slurm
  31. [computer01:47982] [prterun-computer01-47982@0,0] plm:ssh_setup on agent ssh : rsh path NULL
  32. [computer01:47982] [prterun-computer01-47982@0,0] plm:base:receive start comm
  33. [computer01:47982] mca: base: component_find: searching NULL for ras components
  34. [computer01:47982] mca: base: find_dyn_components: checking NULL for ras components
  35. [computer01:47982] pmix:mca: base: components_register: registering framework ras components
  36. [computer01:47982] pmix:mca: base: components_register: found loaded component simulator
  37. [computer01:47982] pmix:mca: base: components_register: component simulator register function successful
  38. [computer01:47982] pmix:mca: base: components_register: found loaded component pbs
  39. [computer01:47982] pmix:mca: base: components_register: component pbs register function successful
  40. [computer01:47982] pmix:mca: base: components_register: found loaded component slurm
  41. [computer01:47982] pmix:mca: base: components_register: component slurm register function successful
  42. [computer01:47982] mca: base: components_open: opening ras components
  43. [computer01:47982] mca: base: components_open: found loaded component simulator
  44. [computer01:47982] mca: base: components_open: found loaded component pbs
  45. [computer01:47982] mca: base: components_open: component pbs open function successful
  46. [computer01:47982] mca: base: components_open: found loaded component slurm
  47. [computer01:47982] mca: base: components_open: component slurm open function successful
  48. [computer01:47982] mca:base:select: Auto-selecting ras components
  49. [computer01:47982] mca:base:select:(  ras) Querying component [simulator]
  50. [computer01:47982] mca:base:select:(  ras) Querying component [pbs]
  51. [computer01:47982] mca:base:select:(  ras) Querying component [slurm]
  52. [computer01:47982] mca:base:select:(  ras) No component selected!
  53. [computer01:47982] mca: base: component_find: searching NULL for rmaps components
  54. [computer01:47982] mca: base: find_dyn_components: checking NULL for rmaps components
  55. [computer01:47982] pmix:mca: base: components_register: registering framework rmaps components
  56. [computer01:47982] pmix:mca: base: components_register: found loaded component ppr
  57. [computer01:47982] pmix:mca: base: components_register: component ppr register function successful
  58. [computer01:47982] pmix:mca: base: components_register: found loaded component rank_file
  59. [computer01:47982] pmix:mca: base: components_register: component rank_file has no register or open function
  60. [computer01:47982] pmix:mca: base: components_register: found loaded component round_robin
  61. [computer01:47982] pmix:mca: base: components_register: component round_robin register function successful
  62. [computer01:47982] pmix:mca: base: components_register: found loaded component seq
  63. [computer01:47982] pmix:mca: base: components_register: component seq register function successful
  64. [computer01:47982] mca: base: components_open: opening rmaps components
  65. [computer01:47982] mca: base: components_open: found loaded component ppr
  66. [computer01:47982] mca: base: components_open: component ppr open function successful
  67. [computer01:47982] mca: base: components_open: found loaded component rank_file
  68. [computer01:47982] mca: base: components_open: found loaded component round_robin
  69. [computer01:47982] mca: base: components_open: component round_robin open function successful
  70. [computer01:47982] mca: base: components_open: found loaded component seq
  71. [computer01:47982] mca: base: components_open: component seq open function successful
  72. [computer01:47982] mca:rmaps:select: checking available component ppr
  73. [computer01:47982] mca:rmaps:select: Querying component [ppr]
  74. [computer01:47982] mca:rmaps:select: checking available component rank_file
  75. [computer01:47982] mca:rmaps:select: Querying component [rank_file]
  76. [computer01:47982] mca:rmaps:select: checking available component round_robin
  77. [computer01:47982] mca:rmaps:select: Querying component [round_robin]
  78. [computer01:47982] mca:rmaps:select: checking available component seq
  79. [computer01:47982] mca:rmaps:select: Querying component [seq]
  80. [computer01:47982] [prterun-computer01-47982@0,0]: Final mapper priorities
  81. [computer01:47982]     Mapper: rank_file Priority: 100
  82. [computer01:47982]     Mapper: ppr Priority: 90
  83. [computer01:47982]     Mapper: seq Priority: 60
  84. [computer01:47982]     Mapper: round_robin Priority: 10
  85. [computer01:47982] [prterun-computer01-47982@0,0] ras:base:allocate
  86. [computer01:47982] [prterun-computer01-47982@0,0] ras:base:allocate nothing found in module - proceeding to hostfile
  87. [computer01:47982] [prterun-computer01-47982@0,0] ras:base:allocate adding hostfile hosts
  88. [computer01:47982] [prterun-computer01-47982@0,0] hostfile: checking hostfile hosts for nodes
  89. [computer01:47982] [prterun-computer01-47982@0,0] hostfile: node 192.168.180.48 is being included - keep all is FALSE
  90. [computer01:47982] [prterun-computer01-47982@0,0] hostfile: node 192.168.60.203 is being included - keep all is FALSE
  91. [computer01:47982] [prterun-computer01-47982@0,0] hostfile: adding node 192.168.180.48 slots 1
  92. [computer01:47982] [prterun-computer01-47982@0,0] hostfile: adding node 192.168.60.203 slots 1
  93. [computer01:47982] [prterun-computer01-47982@0,0] ras:base:node_insert inserting 2 nodes
  94. [computer01:47982] [prterun-computer01-47982@0,0] ras:base:node_insert updating HNP [192.168.180.48] info to 1 slots
  95. [computer01:47982] [prterun-computer01-47982@0,0] ras:base:node_insert node 192.168.60.203 slots 1
  96.  
  97. ======================   ALLOCATED NODES   ======================
  98.     computer01: slots=1 max_slots=0 slots_inuse=0 state=UP
  99.     Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN
  100.     aliases: 192.168.180.48
  101.     192.168.60.203: slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
  102.     Flags: SLOTS_GIVEN
  103.     aliases: NONE
  104. =================================================================
  105. [computer01:47982] [prterun-computer01-47982@0,0] plm:base:setup_vm
  106. [computer01:47982] [prterun-computer01-47982@0,0] plm:base:setup_vm creating map
  107. [computer01:47982] [prterun-computer01-47982@0,0] setup:vm: working unmanaged allocation
  108. [computer01:47982] [prterun-computer01-47982@0,0] using hostfile hosts
  109. [computer01:47982] [prterun-computer01-47982@0,0] hostfile: checking hostfile hosts for nodes
  110. [computer01:47982] [prterun-computer01-47982@0,0] hostfile: node 192.168.180.48 is being included - keep all is FALSE
  111. [computer01:47982] [prterun-computer01-47982@0,0] hostfile: node 192.168.60.203 is being included - keep all is FALSE
  112. [computer01:47982] [prterun-computer01-47982@0,0] hostfile: adding node 192.168.180.48 slots 1
  113. [computer01:47982] [prterun-computer01-47982@0,0] hostfile: adding node 192.168.60.203 slots 1
  114. [computer01:47982] [prterun-computer01-47982@0,0] checking node 192.168.180.48
  115. [computer01:47982] [prterun-computer01-47982@0,0] ignoring myself
  116. [computer01:47982] [prterun-computer01-47982@0,0] checking node 192.168.60.203
  117. [computer01:47982] [prterun-computer01-47982@0,0] plm:base:setup_vm add new daemon [prterun-computer01-47982@0,1]
  118. [computer01:47982] [prterun-computer01-47982@0,0] plm:base:setup_vm assigning new daemon [prterun-computer01-47982@0,1] to node 192.168.60.203
  119. [computer01:47982] [prterun-computer01-47982@0,0] plm:ssh: launching vm
  120. [computer01:47982] [prterun-computer01-47982@0,0] plm:ssh: local shell: 0 (bash)
  121. [computer01:47982] [prterun-computer01-47982@0,0] plm:ssh: assuming same remote shell as local shell
  122. [computer01:47982] [prterun-computer01-47982@0,0] plm:ssh: remote shell: 0 (bash)
  123. [computer01:47982] [prterun-computer01-47982@0,0] plm:ssh: final template argv:
  124.     /usr/bin/ssh <template> PRTE_PREFIX=/usr/local/openmpi;export PRTE_PREFIX;LD_LIBRARY_PATH=/usr/local/openmpi/lib:/usr/local/openmpi/lib:$LD_LIBRARY_PATH;export LD_LIBRARY_PATH;DYLD_LIBRARY_PATH=/usr/local/openmpi/lib:/usr/local/openmpi/lib:$DYLD_LIBRARY_PATH;export DYLD_LIBRARY_PATH;/usr/local/openmpi/bin/prted --prtemca ess "env" --prtemca ess_base_nspace "prterun-computer01-47982@0" --prtemca ess_base_vpid "<template>" --prtemca ess_base_num_procs "2" --prtemca prte_hnp_uri "[email protected];tcp://192.168.180.48,172.17.180.205,172.168.10.24,172.168.100.24,172.168.10.144,192.168.122.1:59788:24,16,24,24,24,24" --prtemca plm_base_verbose "100" --prtemca rmaps_base_verbose "100" --prtemca ras_base_verbose "100" --prtemca pmix_session_server "1" --prtemca plm "ssh" --tree-spawn --prtemca prte_parent_uri "[email protected];tcp://192.168.180.48,172.17.180.205,172.168.10.24,172.168.100.24,172.168.10.144,192.168.122.1:59788:24,16,24,24,24,24"
  125. [computer01:47982] [prterun-computer01-47982@0,0] plm:ssh:launch daemon 0 not a child of mine
  126. [computer01:47982] [prterun-computer01-47982@0,0] plm:ssh: adding node 192.168.60.203 to launch list
  127. [computer01:47982] [prterun-computer01-47982@0,0] plm:ssh: activating launch event
  128. [computer01:47982] [prterun-computer01-47982@0,0] plm:ssh: recording launch of daemon [prterun-computer01-47982@0,1]
  129. [computer01:47982] [prterun-computer01-47982@0,0] plm:ssh: executing: (/usr/bin/ssh) [/usr/bin/ssh 192.168.60.203 PRTE_PREFIX=/usr/local/openmpi;export PRTE_PREFIX;LD_LIBRARY_PATH=/usr/local/openmpi/lib:/usr/local/openmpi/lib:$LD_LIBRARY_PATH;export LD_LIBRARY_PATH;DYLD_LIBRARY_PATH=/usr/local/openmpi/lib:/usr/local/openmpi/lib:$DYLD_LIBRARY_PATH;export DYLD_LIBRARY_PATH;/usr/local/openmpi/bin/prted --prtemca ess "env" --prtemca ess_base_nspace "prterun-computer01-47982@0" --prtemca ess_base_vpid 1 --prtemca ess_base_num_procs "2" --prtemca prte_hnp_uri "[email protected];tcp://192.168.180.48,172.17.180.205,172.168.10.24,172.168.100.24,172.168.10.144,192.168.122.1:59788:24,16,24,24,24,24" --prtemca plm_base_verbose "100" --prtemca rmaps_base_verbose "100" --prtemca ras_base_verbose "100" --prtemca pmix_session_server "1" --prtemca plm "ssh" --tree-spawn --prtemca prte_parent_uri "[email protected];tcp://192.168.180.48,172.17.180.205,172.168.10.24,172.168.100.24,172.168.10.144,192.168.122.1:59788:24,16,24,24,24,24"]
  130. [computer01:47982] [prterun-computer01-47982@0,0] plm:base:orted_report_launch from daemon [prterun-computer01-47982@0,1]
  131. [computer01:47982] [prterun-computer01-47982@0,0] plm:base:orted_report_launch from daemon [prterun-computer01-47982@0,1] on node computer02
  132. [computer01:47982] ALIASES FOR NODE computer02 (computer02)
  133. [computer01:47982]     ALIAS: 192.168.60.203
  134. [computer01:47982]     ALIAS: computer02
  135. [computer01:47982]     ALIAS: 172.17.180.203
  136. [computer01:47982]     ALIAS: 172.168.10.23
  137. [computer01:47982]     ALIAS: 172.168.10.143
  138. [computer01:47982] [prterun-computer01-47982@0,0] RECEIVED TOPOLOGY SIG 2N:2S:2L3:64L2:64L1:64C:128H:0-127::x86_64:le FROM NODE computer02
  139. [computer01:47982] [prterun-computer01-47982@0,0] NEW TOPOLOGY - ADDING SIGNATURE
  140. [computer01:47982] [prterun-computer01-47982@0,0] plm:base:orted_report_launch completed for daemon [prterun-computer01-47982@0,1] at contact prterun-computer01-47982@0.0;tcp://192.168.180.48,172.17.180.205,172.168.10.24,172.168.100.24,172.168.10.144,192.168.122.1:59788:24,16,24,24,24,24
  141. [computer01:47982] [prterun-computer01-47982@0,0] plm:base:orted_report_launch job prterun-computer01-47982@0 recvd 2 of 2 reported daemons
  142. [computer01:47982] [prterun-computer01-47982@0,0] plm:base:receive processing msg
  143. [computer01:47982] [prterun-computer01-47982@0,0] plm:base:receive job launch command from [prterun-computer01-47982@0,0]
  144. [computer01:47982] [prterun-computer01-47982@0,0] plm:base:receive adding hosts
  145.  
  146. ======================   ALLOCATED NODES   ======================
  147.     computer01: slots=1 max_slots=0 slots_inuse=0 state=UP
  148.     Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN
  149.     aliases: 192.168.180.48
  150.     computer02: slots=1 max_slots=0 slots_inuse=0 state=UP
  151.     Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN
  152.     aliases: 192.168.60.203,computer02,172.17.180.203,172.168.10.23,172.168.10.143
  153. =================================================================
  154. [computer01:47982] [prterun-computer01-47982@0,0] plm:base:receive calling spawn
  155. [computer01:47982] [prterun-computer01-47982@0,0] plm:base:receive done processing commands
  156. [computer01:47982] [prterun-computer01-47982@0,0] plm:base:setup_job
  157. [computer01:47982] [prterun-computer01-47982@0,0] ras:base:allocate
  158. [computer01:47982] [prterun-computer01-47982@0,0] ras:base:allocate allocation already read
  159. [computer01:47982] [prterun-computer01-47982@0,0] plm:base:setup_vm
  160. [computer01:47982] [prterun-computer01-47982@0,0] plm_base:setup_vm NODE computer02 WAS NOT ADDED
  161. [computer01:47982] [prterun-computer01-47982@0,0] plm:base:setup_vm no new daemons required
  162. [computer01:47982] mca:rmaps: mapping job prterun-computer01-47982@1
  163. [computer01:47982] mca:rmaps: setting mapping policies for job prterun-computer01-47982@1 inherit TRUE hwtcpus FALSE
  164. [computer01:47982] mca:rmaps[355] mapping not given - using bycore
  165. [computer01:47982] setdefaultbinding[314] binding not given - using bycore
  166. [computer01:47982] mca:rmaps:rf: job prterun-computer01-47982@1 not using rankfile policy
  167. [computer01:47982] mca:rmaps:ppr: job prterun-computer01-47982@1 not using ppr mapper PPR NULL policy PPR NOTSET
  168. [computer01:47982] [prterun-computer01-47982@0,0] rmaps:seq called on job prterun-computer01-47982@1
  169. [computer01:47982] mca:rmaps:seq: job prterun-computer01-47982@1 not using seq mapper
  170. [computer01:47982] mca:rmaps:rr: mapping job prterun-computer01-47982@1
  171. [computer01:47982] [prterun-computer01-47982@0,0] using hostfile hosts
  172. [computer01:47982] [prterun-computer01-47982@0,0] hostfile: checking hostfile hosts for nodes
  173. [computer01:47982] [prterun-computer01-47982@0,0] hostfile: node 192.168.180.48 is being included - keep all is FALSE
  174. [computer01:47982] [prterun-computer01-47982@0,0] hostfile: node 192.168.60.203 is being included - keep all is FALSE
  175. [computer01:47982] [prterun-computer01-47982@0,0] hostfile: adding node 192.168.180.48 slots 1
  176. [computer01:47982] [prterun-computer01-47982@0,0] hostfile: adding node 192.168.60.203 slots 1
  177. [computer01:47982] NODE computer01 DOESNT MATCH NODE 192.168.60.203
  178. [computer01:47982] [prterun-computer01-47982@0,0] node computer01 has 1 slots available
  179. [computer01:47982] [prterun-computer01-47982@0,0] node computer02 has 1 slots available
  180. [computer01:47982] AVAILABLE NODES FOR MAPPING:
  181. [computer01:47982]     node: computer01 daemon: 0 slots_available: 1
  182. [computer01:47982]     node: computer02 daemon: 1 slots_available: 1
  183. [computer01:47982] mca:rmaps:rr: mapping by Core for job prterun-computer01-47982@1 slots 2 num_procs 2
  184. [computer01:47982] mca:rmaps:rr: found 56 Core objects on node computer01
  185. [computer01:47982] mca:rmaps:rr: assigning nprocs 1
  186. [computer01:47982] mca:rmaps:rr: assigning proc to object 0
  187. [computer01:47982] [prterun-computer01-47982@0,0] get_avail_ncpus: node computer01 has 0 procs on it
  188. [computer01:47982] mca:rmaps: compute bindings for job prterun-computer01-47982@1 with policy CORE:IF-SUPPORTED[1007]
  189. [computer01:47982] mca:rmaps: bind [prterun-computer01-47982@1,INVALID] with policy CORE:IF-SUPPORTED
  190. [computer01:47982] [prterun-computer01-47982@0,0] BOUND PROC [prterun-computer01-47982@1,INVALID][computer01] TO package[0][core:0]
  191. [computer01:47982] mca:rmaps:rr: found 64 Core objects on node computer02
  192. [computer01:47982] mca:rmaps:rr: assigning nprocs 1
  193. [computer01:47982] mca:rmaps:rr: assigning proc to object 0
  194. [computer01:47982] [prterun-computer01-47982@0,0] get_avail_ncpus: node computer02 has 0 procs on it
  195. [computer01:47982] mca:rmaps: compute bindings for job prterun-computer01-47982@1 with policy CORE:IF-SUPPORTED[1007]
  196. [computer01:47982] mca:rmaps: bind [prterun-computer01-47982@1,INVALID] with policy CORE:IF-SUPPORTED
  197. [computer01:47982] [prterun-computer01-47982@0,0] BOUND PROC [prterun-computer01-47982@1,INVALID][computer02] TO package[0][core:0]
  198. [computer01:47982] [prterun-computer01-47982@0,0] complete_setup on job prterun-computer01-47982@1
  199. [computer01:47982] [prterun-computer01-47982@0,0] plm:base:launch_apps for job prterun-computer01-47982@1
  200. [computer01:47982] [prterun-computer01-47982@0,0] plm:base:send launch msg for job prterun-computer01-47982@1
  201. [computer01:47982] [prterun-computer01-47982@0,0] plm:base:receive processing msg
  202. [computer01:47982] [prterun-computer01-47982@0,0] plm:base:receive local launch complete command from [prterun-computer01-47982@0,1]
  203. [computer01:47982] [prterun-computer01-47982@0,0] plm:base:receive got local launch complete for job prterun-computer01-47982@1
  204. [computer01:47982] [prterun-computer01-47982@0,0] plm:base:receive got local launch complete for vpid 1
  205. [computer01:47982] [prterun-computer01-47982@0,0] plm:base:receive got local launch complete for vpid 1 state RUNNING
  206. [computer01:47982] [prterun-computer01-47982@0,0] plm:base:receive done processing commands
  207. [computer01:47982] [prterun-computer01-47982@0,0] plm:base:launch wiring up iof for job prterun-computer01-47982@1
  208. [computer01:47982] [prterun-computer01-47982@0,0] plm:base:receive processing msg
  209. [computer01:47982] [prterun-computer01-47982@0,0] plm:base:receive registered command from [prterun-computer01-47982@0,1]
  210. [computer01:47982] [prterun-computer01-47982@0,0] plm:base:receive got registered for job prterun-computer01-47982@1
  211. [computer01:47982] [prterun-computer01-47982@0,0] plm:base:receive got registered for vpid 1
  212. [computer01:47982] [prterun-computer01-47982@0,0] plm:base:receive done processing commands
  213. [computer01:47982] [prterun-computer01-47982@0,0] plm:base:launch prterun-computer01-47982@1 registered
  214. [computer01:47982] [prterun-computer01-47982@0,0] plm:base:prted_cmd sending prted_exit commands  #### ctrl + c
  215. Abort is in progress...hit ctrl-c again to forcibly terminate
Add Comment
Please, Sign In to add comment