Not a member of Pastebin yet?
Sign Up,
it unlocks many cool features!
- 1. This command now runs correctly
- (py3.9) ➜ /share mpirun -n 2 --machinefile hosts --mca plm_base_verbose 100 --mca rmaps_base_verbose 100 --mca ras_base_verbose 100 uptime
- 2. But this command gets stuck. It seems to be the mpi program that gets stuck.
- test.py:
- import mpi4py
- from mpi4py import MPI
- (py3.9) ➜ /share mpirun -n 2 --machinefile hosts --mca plm_base_verbose 100 --mca rmaps_base_verbose 100 --mca ras_base_verbose 100 python test.py
- [computer01:47982] mca: base: component_find: searching NULL for plm components
- [computer01:47982] mca: base: find_dyn_components: checking NULL for plm components
- [computer01:47982] pmix:mca: base: components_register: registering framework plm components
- [computer01:47982] pmix:mca: base: components_register: found loaded component slurm
- [computer01:47982] pmix:mca: base: components_register: component slurm register function successful
- [computer01:47982] pmix:mca: base: components_register: found loaded component ssh
- [computer01:47982] pmix:mca: base: components_register: component ssh register function successful
- [computer01:47982] mca: base: components_open: opening plm components
- [computer01:47982] mca: base: components_open: found loaded component slurm
- [computer01:47982] mca: base: components_open: component slurm open function successful
- [computer01:47982] mca: base: components_open: found loaded component ssh
- [computer01:47982] mca: base: components_open: component ssh open function successful
- [computer01:47982] mca:base:select: Auto-selecting plm components
- [computer01:47982] mca:base:select:( plm) Querying component [slurm]
- [computer01:47982] mca:base:select:( plm) Querying component [ssh]
- [computer01:47982] [[INVALID],0] plm:ssh_lookup on agent ssh : rsh path NULL
- [computer01:47982] mca:base:select:( plm) Query of component [ssh] set priority to 10
- [computer01:47982] mca:base:select:( plm) Selected component [ssh]
- [computer01:47982] mca: base: close: component slurm closed
- [computer01:47982] mca: base: close: unloading component slurm
- [computer01:47982] [prterun-computer01-47982@0,0] plm:ssh_setup on agent ssh : rsh path NULL
- [computer01:47982] [prterun-computer01-47982@0,0] plm:base:receive start comm
- [computer01:47982] mca: base: component_find: searching NULL for ras components
- [computer01:47982] mca: base: find_dyn_components: checking NULL for ras components
- [computer01:47982] pmix:mca: base: components_register: registering framework ras components
- [computer01:47982] pmix:mca: base: components_register: found loaded component simulator
- [computer01:47982] pmix:mca: base: components_register: component simulator register function successful
- [computer01:47982] pmix:mca: base: components_register: found loaded component pbs
- [computer01:47982] pmix:mca: base: components_register: component pbs register function successful
- [computer01:47982] pmix:mca: base: components_register: found loaded component slurm
- [computer01:47982] pmix:mca: base: components_register: component slurm register function successful
- [computer01:47982] mca: base: components_open: opening ras components
- [computer01:47982] mca: base: components_open: found loaded component simulator
- [computer01:47982] mca: base: components_open: found loaded component pbs
- [computer01:47982] mca: base: components_open: component pbs open function successful
- [computer01:47982] mca: base: components_open: found loaded component slurm
- [computer01:47982] mca: base: components_open: component slurm open function successful
- [computer01:47982] mca:base:select: Auto-selecting ras components
- [computer01:47982] mca:base:select:( ras) Querying component [simulator]
- [computer01:47982] mca:base:select:( ras) Querying component [pbs]
- [computer01:47982] mca:base:select:( ras) Querying component [slurm]
- [computer01:47982] mca:base:select:( ras) No component selected!
- [computer01:47982] mca: base: component_find: searching NULL for rmaps components
- [computer01:47982] mca: base: find_dyn_components: checking NULL for rmaps components
- [computer01:47982] pmix:mca: base: components_register: registering framework rmaps components
- [computer01:47982] pmix:mca: base: components_register: found loaded component ppr
- [computer01:47982] pmix:mca: base: components_register: component ppr register function successful
- [computer01:47982] pmix:mca: base: components_register: found loaded component rank_file
- [computer01:47982] pmix:mca: base: components_register: component rank_file has no register or open function
- [computer01:47982] pmix:mca: base: components_register: found loaded component round_robin
- [computer01:47982] pmix:mca: base: components_register: component round_robin register function successful
- [computer01:47982] pmix:mca: base: components_register: found loaded component seq
- [computer01:47982] pmix:mca: base: components_register: component seq register function successful
- [computer01:47982] mca: base: components_open: opening rmaps components
- [computer01:47982] mca: base: components_open: found loaded component ppr
- [computer01:47982] mca: base: components_open: component ppr open function successful
- [computer01:47982] mca: base: components_open: found loaded component rank_file
- [computer01:47982] mca: base: components_open: found loaded component round_robin
- [computer01:47982] mca: base: components_open: component round_robin open function successful
- [computer01:47982] mca: base: components_open: found loaded component seq
- [computer01:47982] mca: base: components_open: component seq open function successful
- [computer01:47982] mca:rmaps:select: checking available component ppr
- [computer01:47982] mca:rmaps:select: Querying component [ppr]
- [computer01:47982] mca:rmaps:select: checking available component rank_file
- [computer01:47982] mca:rmaps:select: Querying component [rank_file]
- [computer01:47982] mca:rmaps:select: checking available component round_robin
- [computer01:47982] mca:rmaps:select: Querying component [round_robin]
- [computer01:47982] mca:rmaps:select: checking available component seq
- [computer01:47982] mca:rmaps:select: Querying component [seq]
- [computer01:47982] [prterun-computer01-47982@0,0]: Final mapper priorities
- [computer01:47982] Mapper: rank_file Priority: 100
- [computer01:47982] Mapper: ppr Priority: 90
- [computer01:47982] Mapper: seq Priority: 60
- [computer01:47982] Mapper: round_robin Priority: 10
- [computer01:47982] [prterun-computer01-47982@0,0] ras:base:allocate
- [computer01:47982] [prterun-computer01-47982@0,0] ras:base:allocate nothing found in module - proceeding to hostfile
- [computer01:47982] [prterun-computer01-47982@0,0] ras:base:allocate adding hostfile hosts
- [computer01:47982] [prterun-computer01-47982@0,0] hostfile: checking hostfile hosts for nodes
- [computer01:47982] [prterun-computer01-47982@0,0] hostfile: node 192.168.180.48 is being included - keep all is FALSE
- [computer01:47982] [prterun-computer01-47982@0,0] hostfile: node 192.168.60.203 is being included - keep all is FALSE
- [computer01:47982] [prterun-computer01-47982@0,0] hostfile: adding node 192.168.180.48 slots 1
- [computer01:47982] [prterun-computer01-47982@0,0] hostfile: adding node 192.168.60.203 slots 1
- [computer01:47982] [prterun-computer01-47982@0,0] ras:base:node_insert inserting 2 nodes
- [computer01:47982] [prterun-computer01-47982@0,0] ras:base:node_insert updating HNP [192.168.180.48] info to 1 slots
- [computer01:47982] [prterun-computer01-47982@0,0] ras:base:node_insert node 192.168.60.203 slots 1
- ====================== ALLOCATED NODES ======================
- computer01: slots=1 max_slots=0 slots_inuse=0 state=UP
- Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN
- aliases: 192.168.180.48
- 192.168.60.203: slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
- Flags: SLOTS_GIVEN
- aliases: NONE
- =================================================================
- [computer01:47982] [prterun-computer01-47982@0,0] plm:base:setup_vm
- [computer01:47982] [prterun-computer01-47982@0,0] plm:base:setup_vm creating map
- [computer01:47982] [prterun-computer01-47982@0,0] setup:vm: working unmanaged allocation
- [computer01:47982] [prterun-computer01-47982@0,0] using hostfile hosts
- [computer01:47982] [prterun-computer01-47982@0,0] hostfile: checking hostfile hosts for nodes
- [computer01:47982] [prterun-computer01-47982@0,0] hostfile: node 192.168.180.48 is being included - keep all is FALSE
- [computer01:47982] [prterun-computer01-47982@0,0] hostfile: node 192.168.60.203 is being included - keep all is FALSE
- [computer01:47982] [prterun-computer01-47982@0,0] hostfile: adding node 192.168.180.48 slots 1
- [computer01:47982] [prterun-computer01-47982@0,0] hostfile: adding node 192.168.60.203 slots 1
- [computer01:47982] [prterun-computer01-47982@0,0] checking node 192.168.180.48
- [computer01:47982] [prterun-computer01-47982@0,0] ignoring myself
- [computer01:47982] [prterun-computer01-47982@0,0] checking node 192.168.60.203
- [computer01:47982] [prterun-computer01-47982@0,0] plm:base:setup_vm add new daemon [prterun-computer01-47982@0,1]
- [computer01:47982] [prterun-computer01-47982@0,0] plm:base:setup_vm assigning new daemon [prterun-computer01-47982@0,1] to node 192.168.60.203
- [computer01:47982] [prterun-computer01-47982@0,0] plm:ssh: launching vm
- [computer01:47982] [prterun-computer01-47982@0,0] plm:ssh: local shell: 0 (bash)
- [computer01:47982] [prterun-computer01-47982@0,0] plm:ssh: assuming same remote shell as local shell
- [computer01:47982] [prterun-computer01-47982@0,0] plm:ssh: remote shell: 0 (bash)
- [computer01:47982] [prterun-computer01-47982@0,0] plm:ssh: final template argv:
- /usr/bin/ssh <template> PRTE_PREFIX=/usr/local/openmpi;export PRTE_PREFIX;LD_LIBRARY_PATH=/usr/local/openmpi/lib:/usr/local/openmpi/lib:$LD_LIBRARY_PATH;export LD_LIBRARY_PATH;DYLD_LIBRARY_PATH=/usr/local/openmpi/lib:/usr/local/openmpi/lib:$DYLD_LIBRARY_PATH;export DYLD_LIBRARY_PATH;/usr/local/openmpi/bin/prted --prtemca ess "env" --prtemca ess_base_nspace "prterun-computer01-47982@0" --prtemca ess_base_vpid "<template>" --prtemca ess_base_num_procs "2" --prtemca prte_hnp_uri "[email protected];tcp://192.168.180.48,172.17.180.205,172.168.10.24,172.168.100.24,172.168.10.144,192.168.122.1:59788:24,16,24,24,24,24" --prtemca plm_base_verbose "100" --prtemca rmaps_base_verbose "100" --prtemca ras_base_verbose "100" --prtemca pmix_session_server "1" --prtemca plm "ssh" --tree-spawn --prtemca prte_parent_uri "[email protected];tcp://192.168.180.48,172.17.180.205,172.168.10.24,172.168.100.24,172.168.10.144,192.168.122.1:59788:24,16,24,24,24,24"
- [computer01:47982] [prterun-computer01-47982@0,0] plm:ssh:launch daemon 0 not a child of mine
- [computer01:47982] [prterun-computer01-47982@0,0] plm:ssh: adding node 192.168.60.203 to launch list
- [computer01:47982] [prterun-computer01-47982@0,0] plm:ssh: activating launch event
- [computer01:47982] [prterun-computer01-47982@0,0] plm:ssh: recording launch of daemon [prterun-computer01-47982@0,1]
- [computer01:47982] [prterun-computer01-47982@0,0] plm:ssh: executing: (/usr/bin/ssh) [/usr/bin/ssh 192.168.60.203 PRTE_PREFIX=/usr/local/openmpi;export PRTE_PREFIX;LD_LIBRARY_PATH=/usr/local/openmpi/lib:/usr/local/openmpi/lib:$LD_LIBRARY_PATH;export LD_LIBRARY_PATH;DYLD_LIBRARY_PATH=/usr/local/openmpi/lib:/usr/local/openmpi/lib:$DYLD_LIBRARY_PATH;export DYLD_LIBRARY_PATH;/usr/local/openmpi/bin/prted --prtemca ess "env" --prtemca ess_base_nspace "prterun-computer01-47982@0" --prtemca ess_base_vpid 1 --prtemca ess_base_num_procs "2" --prtemca prte_hnp_uri "[email protected];tcp://192.168.180.48,172.17.180.205,172.168.10.24,172.168.100.24,172.168.10.144,192.168.122.1:59788:24,16,24,24,24,24" --prtemca plm_base_verbose "100" --prtemca rmaps_base_verbose "100" --prtemca ras_base_verbose "100" --prtemca pmix_session_server "1" --prtemca plm "ssh" --tree-spawn --prtemca prte_parent_uri "[email protected];tcp://192.168.180.48,172.17.180.205,172.168.10.24,172.168.100.24,172.168.10.144,192.168.122.1:59788:24,16,24,24,24,24"]
- [computer01:47982] [prterun-computer01-47982@0,0] plm:base:orted_report_launch from daemon [prterun-computer01-47982@0,1]
- [computer01:47982] [prterun-computer01-47982@0,0] plm:base:orted_report_launch from daemon [prterun-computer01-47982@0,1] on node computer02
- [computer01:47982] ALIASES FOR NODE computer02 (computer02)
- [computer01:47982] ALIAS: 192.168.60.203
- [computer01:47982] ALIAS: computer02
- [computer01:47982] ALIAS: 172.17.180.203
- [computer01:47982] ALIAS: 172.168.10.23
- [computer01:47982] ALIAS: 172.168.10.143
- [computer01:47982] [prterun-computer01-47982@0,0] RECEIVED TOPOLOGY SIG 2N:2S:2L3:64L2:64L1:64C:128H:0-127::x86_64:le FROM NODE computer02
- [computer01:47982] [prterun-computer01-47982@0,0] NEW TOPOLOGY - ADDING SIGNATURE
- [computer01:47982] [prterun-computer01-47982@0,0] plm:base:orted_report_launch completed for daemon [prterun-computer01-47982@0,1] at contact prterun-computer01-47982@0.0;tcp://192.168.180.48,172.17.180.205,172.168.10.24,172.168.100.24,172.168.10.144,192.168.122.1:59788:24,16,24,24,24,24
- [computer01:47982] [prterun-computer01-47982@0,0] plm:base:orted_report_launch job prterun-computer01-47982@0 recvd 2 of 2 reported daemons
- [computer01:47982] [prterun-computer01-47982@0,0] plm:base:receive processing msg
- [computer01:47982] [prterun-computer01-47982@0,0] plm:base:receive job launch command from [prterun-computer01-47982@0,0]
- [computer01:47982] [prterun-computer01-47982@0,0] plm:base:receive adding hosts
- ====================== ALLOCATED NODES ======================
- computer01: slots=1 max_slots=0 slots_inuse=0 state=UP
- Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN
- aliases: 192.168.180.48
- computer02: slots=1 max_slots=0 slots_inuse=0 state=UP
- Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN
- aliases: 192.168.60.203,computer02,172.17.180.203,172.168.10.23,172.168.10.143
- =================================================================
- [computer01:47982] [prterun-computer01-47982@0,0] plm:base:receive calling spawn
- [computer01:47982] [prterun-computer01-47982@0,0] plm:base:receive done processing commands
- [computer01:47982] [prterun-computer01-47982@0,0] plm:base:setup_job
- [computer01:47982] [prterun-computer01-47982@0,0] ras:base:allocate
- [computer01:47982] [prterun-computer01-47982@0,0] ras:base:allocate allocation already read
- [computer01:47982] [prterun-computer01-47982@0,0] plm:base:setup_vm
- [computer01:47982] [prterun-computer01-47982@0,0] plm_base:setup_vm NODE computer02 WAS NOT ADDED
- [computer01:47982] [prterun-computer01-47982@0,0] plm:base:setup_vm no new daemons required
- [computer01:47982] mca:rmaps: mapping job prterun-computer01-47982@1
- [computer01:47982] mca:rmaps: setting mapping policies for job prterun-computer01-47982@1 inherit TRUE hwtcpus FALSE
- [computer01:47982] mca:rmaps[355] mapping not given - using bycore
- [computer01:47982] setdefaultbinding[314] binding not given - using bycore
- [computer01:47982] mca:rmaps:rf: job prterun-computer01-47982@1 not using rankfile policy
- [computer01:47982] mca:rmaps:ppr: job prterun-computer01-47982@1 not using ppr mapper PPR NULL policy PPR NOTSET
- [computer01:47982] [prterun-computer01-47982@0,0] rmaps:seq called on job prterun-computer01-47982@1
- [computer01:47982] mca:rmaps:seq: job prterun-computer01-47982@1 not using seq mapper
- [computer01:47982] mca:rmaps:rr: mapping job prterun-computer01-47982@1
- [computer01:47982] [prterun-computer01-47982@0,0] using hostfile hosts
- [computer01:47982] [prterun-computer01-47982@0,0] hostfile: checking hostfile hosts for nodes
- [computer01:47982] [prterun-computer01-47982@0,0] hostfile: node 192.168.180.48 is being included - keep all is FALSE
- [computer01:47982] [prterun-computer01-47982@0,0] hostfile: node 192.168.60.203 is being included - keep all is FALSE
- [computer01:47982] [prterun-computer01-47982@0,0] hostfile: adding node 192.168.180.48 slots 1
- [computer01:47982] [prterun-computer01-47982@0,0] hostfile: adding node 192.168.60.203 slots 1
- [computer01:47982] NODE computer01 DOESNT MATCH NODE 192.168.60.203
- [computer01:47982] [prterun-computer01-47982@0,0] node computer01 has 1 slots available
- [computer01:47982] [prterun-computer01-47982@0,0] node computer02 has 1 slots available
- [computer01:47982] AVAILABLE NODES FOR MAPPING:
- [computer01:47982] node: computer01 daemon: 0 slots_available: 1
- [computer01:47982] node: computer02 daemon: 1 slots_available: 1
- [computer01:47982] mca:rmaps:rr: mapping by Core for job prterun-computer01-47982@1 slots 2 num_procs 2
- [computer01:47982] mca:rmaps:rr: found 56 Core objects on node computer01
- [computer01:47982] mca:rmaps:rr: assigning nprocs 1
- [computer01:47982] mca:rmaps:rr: assigning proc to object 0
- [computer01:47982] [prterun-computer01-47982@0,0] get_avail_ncpus: node computer01 has 0 procs on it
- [computer01:47982] mca:rmaps: compute bindings for job prterun-computer01-47982@1 with policy CORE:IF-SUPPORTED[1007]
- [computer01:47982] mca:rmaps: bind [prterun-computer01-47982@1,INVALID] with policy CORE:IF-SUPPORTED
- [computer01:47982] [prterun-computer01-47982@0,0] BOUND PROC [prterun-computer01-47982@1,INVALID][computer01] TO package[0][core:0]
- [computer01:47982] mca:rmaps:rr: found 64 Core objects on node computer02
- [computer01:47982] mca:rmaps:rr: assigning nprocs 1
- [computer01:47982] mca:rmaps:rr: assigning proc to object 0
- [computer01:47982] [prterun-computer01-47982@0,0] get_avail_ncpus: node computer02 has 0 procs on it
- [computer01:47982] mca:rmaps: compute bindings for job prterun-computer01-47982@1 with policy CORE:IF-SUPPORTED[1007]
- [computer01:47982] mca:rmaps: bind [prterun-computer01-47982@1,INVALID] with policy CORE:IF-SUPPORTED
- [computer01:47982] [prterun-computer01-47982@0,0] BOUND PROC [prterun-computer01-47982@1,INVALID][computer02] TO package[0][core:0]
- [computer01:47982] [prterun-computer01-47982@0,0] complete_setup on job prterun-computer01-47982@1
- [computer01:47982] [prterun-computer01-47982@0,0] plm:base:launch_apps for job prterun-computer01-47982@1
- [computer01:47982] [prterun-computer01-47982@0,0] plm:base:send launch msg for job prterun-computer01-47982@1
- [computer01:47982] [prterun-computer01-47982@0,0] plm:base:receive processing msg
- [computer01:47982] [prterun-computer01-47982@0,0] plm:base:receive local launch complete command from [prterun-computer01-47982@0,1]
- [computer01:47982] [prterun-computer01-47982@0,0] plm:base:receive got local launch complete for job prterun-computer01-47982@1
- [computer01:47982] [prterun-computer01-47982@0,0] plm:base:receive got local launch complete for vpid 1
- [computer01:47982] [prterun-computer01-47982@0,0] plm:base:receive got local launch complete for vpid 1 state RUNNING
- [computer01:47982] [prterun-computer01-47982@0,0] plm:base:receive done processing commands
- [computer01:47982] [prterun-computer01-47982@0,0] plm:base:launch wiring up iof for job prterun-computer01-47982@1
- [computer01:47982] [prterun-computer01-47982@0,0] plm:base:receive processing msg
- [computer01:47982] [prterun-computer01-47982@0,0] plm:base:receive registered command from [prterun-computer01-47982@0,1]
- [computer01:47982] [prterun-computer01-47982@0,0] plm:base:receive got registered for job prterun-computer01-47982@1
- [computer01:47982] [prterun-computer01-47982@0,0] plm:base:receive got registered for vpid 1
- [computer01:47982] [prterun-computer01-47982@0,0] plm:base:receive done processing commands
- [computer01:47982] [prterun-computer01-47982@0,0] plm:base:launch prterun-computer01-47982@1 registered
- [computer01:47982] [prterun-computer01-47982@0,0] plm:base:prted_cmd sending prted_exit commands #### ctrl + c
- Abort is in progress...hit ctrl-c again to forcibly terminate
Add Comment
Please, Sign In to add comment