Advertisement
Not a member of Pastebin yet?
Sign Up,
it unlocks many cool features!
- [root@slurm-master dmtcp-trunk]# dmtcp_coordinator
- dmtcp_coordinator (DMTCP) 2.3.1
- License LGPLv3+: GNU LGPL version 3 or later
- <http://gnu.org/licenses/lgpl.html>.
- This program comes with ABSOLUTELY NO WARRANTY.
- This is free software, and you are welcome to redistribute it
- under certain conditions; see COPYING file for details.
- (Use flag "-q" to hide this message.)
- [6364] TRACE at dmtcp_coordinator.cpp:1795 in main; REASON='New DMTCP coordinator starting.'
- dmtcp::UniquePid::ThisProcess() = 6db90f3d5a9dd200-6364-546a1de3
- dmtcp_coordinator starting...
- Host: slurm-master (192.168.122.11)
- Port: 7779
- Checkpoint Interval: disabled (checkpoint manually instead)
- Exit on last client: 0
- Type '?' for help.
- [6364] TRACE at dmtcp_coordinator.cpp:923 in onConnect; REASON='accepting new connection'
- remote.sockfd() = 5
- (strerror((*__errno_location ()))) = Success
- [6364] TRACE at dmtcp_coordinator.cpp:932 in onConnect; REASON='Reading from incoming connection...'
- [6364] TRACE at dmtcp_coordinator.cpp:1225 in validateNewWorkerProcess; REASON='First process connected. Creating new computation group'
- compId = 6db90f3d5a9dd200-40000-546a1de8
- [6364] NOTE at dmtcp_coordinator.cpp:1040 in onConnect; REASON='worker connected'
- hello_remote.from = 6db90f3d5a9dd200-6365-546a1de8
- [6364] TRACE at dmtcp_coordinator.cpp:1045 in onConnect; REASON='END'
- clients.size() = 1
- [6364] NOTE at dmtcp_coordinator.cpp:825 in onData; REASON='Updating process Information after exec()'
- progname = srun
- msg.from = 6db90f3d5a9dd200-40000-546a1de8
- client->identity() = 6db90f3d5a9dd200-6365-546a1de8
- [6364] TRACE at dmtcp_coordinator.cpp:747 in onData; REASON='got DMT_OK message'
- msg.from = 6db90f3d5a9dd200-40000-546a1de8
- msg.state = WorkerState::RUNNING
- oldState = WorkerState::RUNNING
- newState = WorkerState::RUNNING
- [6364] TRACE at dmtcp_coordinator.cpp:923 in onConnect; REASON='accepting new connection'
- remote.sockfd() = 6
- (strerror((*__errno_location ()))) = Success
- [6364] TRACE at dmtcp_coordinator.cpp:932 in onConnect; REASON='Reading from incoming connection...'
- [6364] TRACE at dmtcp_coordinator.cpp:1228 in validateNewWorkerProcess; REASON='New process connected'
- hello_remote.from = 6db90f3d5a9dd200-40000-546a1de8
- client->prefixDir() =
- client->virtualPid() = 41000
- [6364] NOTE at dmtcp_coordinator.cpp:1040 in onConnect; REASON='worker connected'
- hello_remote.from = 6db90f3d5a9dd200-40000-546a1de8
- [6364] TRACE at dmtcp_coordinator.cpp:1045 in onConnect; REASON='END'
- clients.size() = 2
- [6364] NOTE at dmtcp_coordinator.cpp:816 in onData; REASON='Updating process Information after fork()'
- client->hostname() = slurm-master
- client->progname() = srun_(forked)
- msg.from = 6db90f3d5a9dd200-41000-546a1de8
- client->identity() = 6db90f3d5a9dd200-40000-546a1de8
- (start checkpoint)
- c
- [6364] TRACE at dmtcp_coordinator.cpp:516 in handleUserCommand; REASON='checkpointing...'
- [6364] NOTE at dmtcp_coordinator.cpp:1271 in startCheckpoint; REASON='starting checkpoint, suspending all nodes'
- s.numPeers = 2
- [6364] NOTE at dmtcp_coordinator.cpp:1273 in startCheckpoint; REASON='Incremented Generation'
- compId.generation() = 1
- [6364] TRACE at dmtcp_coordinator.cpp:1312 in broadcastMessage; REASON='sending message'
- type = DMT_DO_SUSPEND
- [6364] TRACE at dmtcp_coordinator.cpp:747 in onData; REASON='got DMT_OK message'
- msg.from = 6db90f3d5a9dd200-40000-546a1de8
- msg.state = WorkerState::SUSPENDED
- oldState = WorkerState::RUNNING
- newState = WorkerState::RUNNING
- [6364] TRACE at dmtcp_coordinator.cpp:747 in onData; REASON='got DMT_OK message'
- msg.from = 6db90f3d5a9dd200-41000-546a1de8
- msg.state = WorkerState::SUSPENDED
- oldState = WorkerState::RUNNING
- newState = WorkerState::SUSPENDED
- [6364] NOTE at dmtcp_coordinator.cpp:615 in updateMinimumState; REASON='locking all nodes'
- [6364] TRACE at dmtcp_coordinator.cpp:1312 in broadcastMessage; REASON='sending message'
- type = DMT_DO_FD_LEADER_ELECTION
- [6364] TRACE at dmtcp_coordinator.cpp:747 in onData; REASON='got DMT_OK message'
- msg.from = 6db90f3d5a9dd200-41000-546a1de8
- msg.state = WorkerState::FD_LEADER_ELECTION
- oldState = WorkerState::SUSPENDED
- newState = WorkerState::SUSPENDED
- [6364] TRACE at dmtcp_coordinator.cpp:747 in onData; REASON='got DMT_OK message'
- msg.from = 6db90f3d5a9dd200-40000-546a1de8
- msg.state = WorkerState::FD_LEADER_ELECTION
- oldState = WorkerState::SUSPENDED
- newState = WorkerState::FD_LEADER_ELECTION
- [6364] NOTE at dmtcp_coordinator.cpp:621 in updateMinimumState; REASON='draining all nodes'
- [6364] TRACE at dmtcp_coordinator.cpp:1312 in broadcastMessage; REASON='sending message'
- type = DMT_DO_DRAIN
- [6364] TRACE at dmtcp_coordinator.cpp:747 in onData; REASON='got DMT_OK message'
- msg.from = 6db90f3d5a9dd200-41000-546a1de8
- msg.state = WorkerState::DRAINED
- oldState = WorkerState::FD_LEADER_ELECTION
- newState = WorkerState::FD_LEADER_ELECTION
- (it hangs here for a while)
- [6364] TRACE at dmtcp_coordinator.cpp:747 in onData; REASON='got DMT_OK message'
- msg.from = 6db90f3d5a9dd200-40000-546a1de8
- msg.state = WorkerState::DRAINED
- oldState = WorkerState::FD_LEADER_ELECTION
- newState = WorkerState::DRAINED
- [6364] NOTE at dmtcp_coordinator.cpp:627 in updateMinimumState; REASON='checkpointing all nodes'
- [6364] TRACE at dmtcp_coordinator.cpp:1312 in broadcastMessage; REASON='sending message'
- type = DMT_DO_CHECKPOINT
- [6364] TRACE at dmtcp_coordinator.cpp:765 in onData; REASON='recording restart info'
- ckptFilename = /home/slurm/ckpt_srun_6db90f3d5a9dd200-41000-546a1de8.dmtcp
- hostname = slurm-master
- [6364] TRACE at dmtcp_coordinator.cpp:747 in onData; REASON='got DMT_OK message'
- msg.from = 6db90f3d5a9dd200-41000-546a1de8
- msg.state = WorkerState::CHECKPOINTED
- oldState = WorkerState::DRAINED
- newState = WorkerState::DRAINED
- [6364] TRACE at dmtcp_coordinator.cpp:765 in onData; REASON='recording restart info'
- ckptFilename = /home/slurm/ckpt_srun_6db90f3d5a9dd200-40000-546a1de8.dmtcp
- hostname = slurm-master
- [6364] TRACE at dmtcp_coordinator.cpp:747 in onData; REASON='got DMT_OK message'
- msg.from = 6db90f3d5a9dd200-40000-546a1de8
- msg.state = WorkerState::CHECKPOINTED
- oldState = WorkerState::DRAINED
- newState = WorkerState::CHECKPOINTED
- [6364] TRACE at dmtcp_coordinator.cpp:1370 in writeRestartScript; REASON='writing restart script'
- uniqueFilename = ./dmtcp_restart_script_6db90f3d5a9dd200-40000-546a1de8.sh
- [6364] TRACE at dmtcp_coordinator.cpp:1419 in writeRestartScript; REASON='Single HOST'
- [6364] TRACE at dmtcp_coordinator.cpp:1522 in writeRestartScript; REASON='linking "dmtcp_restart_script.sh" filename to uniqueFilename'
- filename = dmtcp_restart_script.sh
- dirname = .
- uniqueFilename = ./dmtcp_restart_script_6db90f3d5a9dd200-40000-546a1de8.sh
- [6364] NOTE at dmtcp_coordinator.cpp:641 in updateMinimumState; REASON='building name service database'
- [6364] TRACE at dmtcp_coordinator.cpp:1312 in broadcastMessage; REASON='sending message'
- type = DMT_DO_REGISTER_NAME_SERVICE_DATA
- [6364] TRACE at dmtcp_coordinator.cpp:747 in onData; REASON='got DMT_OK message'
- msg.from = 6db90f3d5a9dd200-41000-546a1de8
- msg.state = WorkerState::NAME_SERVICE_DATA_REGISTERED
- oldState = WorkerState::CHECKPOINTED
- newState = WorkerState::CHECKPOINTED
- [6364] TRACE at dmtcp_coordinator.cpp:747 in onData; REASON='got DMT_OK message'
- msg.from = 6db90f3d5a9dd200-40000-546a1de8
- msg.state = WorkerState::NAME_SERVICE_DATA_REGISTERED
- oldState = WorkerState::CHECKPOINTED
- newState = WorkerState::NAME_SERVICE_DATA_REGISTERED
- [6364] NOTE at dmtcp_coordinator.cpp:657 in updateMinimumState; REASON='entertaining queries now'
- [6364] TRACE at dmtcp_coordinator.cpp:1312 in broadcastMessage; REASON='sending message'
- type = DMT_DO_SEND_QUERIES
- [6364] TRACE at dmtcp_coordinator.cpp:747 in onData; REASON='got DMT_OK message'
- msg.from = 6db90f3d5a9dd200-41000-546a1de8
- msg.state = WorkerState::DONE_QUERYING
- oldState = WorkerState::NAME_SERVICE_DATA_REGISTERED
- newState = WorkerState::NAME_SERVICE_DATA_REGISTERED
- [6364] TRACE at dmtcp_coordinator.cpp:747 in onData; REASON='got DMT_OK message'
- msg.from = 6db90f3d5a9dd200-40000-546a1de8
- msg.state = WorkerState::DONE_QUERYING
- oldState = WorkerState::NAME_SERVICE_DATA_REGISTERED
- newState = WorkerState::DONE_QUERYING
- [6364] NOTE at dmtcp_coordinator.cpp:662 in updateMinimumState; REASON='refilling all nodes'
- [6364] TRACE at dmtcp_coordinator.cpp:1312 in broadcastMessage; REASON='sending message'
- type = DMT_DO_REFILL
- [6364] TRACE at dmtcp_coordinator.cpp:747 in onData; REASON='got DMT_OK message'
- msg.from = 6db90f3d5a9dd200-41000-546a1de8
- msg.state = WorkerState::REFILLED
- oldState = WorkerState::DONE_QUERYING
- newState = WorkerState::DONE_QUERYING
- [6364] TRACE at dmtcp_coordinator.cpp:747 in onData; REASON='got DMT_OK message'
- msg.from = 6db90f3d5a9dd200-40000-546a1de8
- msg.state = WorkerState::REFILLED
- oldState = WorkerState::DONE_QUERYING
- newState = WorkerState::REFILLED
- [6364] NOTE at dmtcp_coordinator.cpp:693 in updateMinimumState; REASON='restarting all nodes'
- [6364] TRACE at dmtcp_coordinator.cpp:1312 in broadcastMessage; REASON='sending message'
- type = DMT_DO_RESUME
- [6364] TRACE at dmtcp_coordinator.cpp:747 in onData; REASON='got DMT_OK message'
- msg.from = 6db90f3d5a9dd200-41000-546a1de8
- msg.state = WorkerState::RUNNING
- oldState = WorkerState::REFILLED
- newState = WorkerState::REFILLED
- [6364] TRACE at dmtcp_coordinator.cpp:747 in onData; REASON='got DMT_OK message'
- msg.from = 6db90f3d5a9dd200-40000-546a1de8
- msg.state = WorkerState::RUNNING
- oldState = WorkerState::REFILLED
- newState = WorkerState::RUNNING
- [6364] NOTE at dmtcp_coordinator.cpp:875 in onDisconnect; REASON='client disconnected'
- client->identity() = 6db90f3d5a9dd200-41000-546a1de8
- [6364] NOTE at dmtcp_coordinator.cpp:875 in onDisconnect; REASON='client disconnected'
- client->identity() = 6db90f3d5a9dd200-40000-546a1de8
- [6364] TRACE at dmtcp_coordinator.cpp:850 in removeStaleSharedAreaFile; REASON='Removing sharedArea file.'
- o.str() = /tmp/dmtcp-root@slurm-master/dmtcpSharedArea.6db90f3d5a9dd200-40000-546a1de8.546a1de87
- ^C[6364] NOTE at dmtcp_coordinator.cpp:556 in handleUserCommand; REASON='killing all connected peers and quitting ...'
- [6364] TRACE at dmtcp_coordinator.cpp:1312 in broadcastMessage; REASON='sending message'
- type = DMT_KILL_PEER
- DMTCP coordinator exiting... (per request)
- [6364] TRACE at dmtcp_coordinator.cpp:850 in removeStaleSharedAreaFile; REASON='Removing sharedArea file.'
- o.str() = /tmp/dmtcp-root@slurm-master/dmtcpSharedArea.6db90f3d5a9dd200-40000-546a1de8.546a1de87
- [6364] TRACE at dmtcp_coordinator.cpp:857 in preExitCleanup; REASON='Removing port-file'
- thePortFile =
- [6364] TRACE at dmtcp_coordinator.cpp:564 in handleUserCommand; REASON='Exiting ...'
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement