After the clusterwide stop and restart: On dnds1-13: ------------------- 2013-03-14 00:11:37,993 INFO org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager: Atomically moving dnds1-4,60020,1363219866063's hlogs to my queue 2013-03-14 00:11:37,997 DEBUG org.apache.hadoop.hbase.replication.ReplicationZookeeper: Creating dnds1-4%2C60020%2C1363219866063.1363219868868 with data 2013-03-14 00:11:37,997 DEBUG org.apache.hadoop.hbase.replication.ReplicationZookeeper: The multi list size is: 5 2013-03-14 00:11:38,068 WARN org.apache.hadoop.hbase.replication.ReplicationZookeeper: Got exception in copyQueuesFromRSUsingMulti: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode at org.apache.zookeeper.KeeperException.create(KeeperException.java:111) at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:945) at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:911) at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.multi(RecoverableZooKeeper.java:531) at org.apache.hadoop.hbase.zookeeper.ZKUtil.multiOrSequential(ZKUtil.java:1436) at org.apache.hadoop.hbase.replication.ReplicationZookeeper.copyQueuesFromRSUsingMulti(ReplicationZookeeper.java:705) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager$NodeFailoverWorker.run(ReplicationSourceManager.java:590) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) 2013-03-14 00:11:38,070 INFO org.apache.hadoop.hbase.metrics: new MBeanInfo 2013-03-14 00:11:38,077 INFO org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Getting 2 rs from peer cluster # 1 2013-03-14 00:11:38,077 INFO org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Choosing peer ist6-dnds1-6,60020,1359508078741 2013-03-14 00:11:38,077 INFO org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Choosing peer ist6-dnds1-3,60020,1359508079058 2013-03-14 00:11:39,080 INFO org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Replicating 4f3d5435-898c-47c2-8821-aeb01f9e87cc -> 74c750a5-4254-4a3b-ab12-063869759edd 2013-03-14 00:11:39,081 DEBUG org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Opening log for replication dnds1-4%2C60020%2C1363219866063.136321986886 8 at 0 2013-03-14 00:11:39,090 DEBUG org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: currentNbOperations:0 and seenEntries:3 and size: 0 2013-03-14 00:11:39,090 INFO org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager: Going to report log #dnds1-4%2C60020%2C1363219866063.1363219868868 for position 1004 in hdfs://cluster/hbase/.oldlogs/dnds1-4%2C60020%2C1363219866063.1363219868868 2013-03-14 00:11:39,139 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server dnds1-13,60020,1363219887385: Writing replication status org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /hbase/replication/rs/dnds1-13,60020,1363219887385/1-dnds1-4,60 020,1363219866063/dnds1-4%2C60020%2C1363219866063.1363219868868 at org.apache.zookeeper.KeeperException.create(KeeperException.java:111) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1266) at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:349) at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:848) at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:900) at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:894) at org.apache.hadoop.hbase.replication.ReplicationZookeeper.writeReplicationStatus(ReplicationZookeeper.java:558) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.logPositionAndCleanOldLogs(ReplicationSourceManager.java:155) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:379) 2013-03-14 00:11:39,140 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: RegionServer abort: loaded coprocessors are: [] ------------------- Another one on dnds1-12: ------------------- 2013-03-14 00:11:35,905 INFO org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager: Atomically moving dnds1-8,60020,1363219865904's hlogs to my queue 2013-03-14 00:11:35,909 DEBUG org.apache.hadoop.hbase.replication.ReplicationZookeeper: Creating dnds1-8%2C60020%2C1363219865904.1363219868852 with data 2013-03-14 00:11:35,937 DEBUG org.apache.hadoop.hbase.replication.ReplicationZookeeper: Creating dnds1-9%2C60020%2C1362533781275.1363217800957 with data 7470 2013-03-14 00:11:35,972 DEBUG org.apache.hadoop.hbase.replication.ReplicationZookeeper: Creating dnds1-9%2C60020%2C1362533781275.1363102598299 with data null 2013-03-14 00:11:35,973 DEBUG org.apache.hadoop.hbase.replication.ReplicationZookeeper: Creating dnds1-9%2C60020%2C1362533781275.1363109798542 with data null 2013-03-14 00:11:35,974 DEBUG org.apache.hadoop.hbase.replication.ReplicationZookeeper: Creating dnds1-9%2C60020%2C1362533781275.1363124199049 with data null 2013-03-14 00:11:35,975 DEBUG org.apache.hadoop.hbase.replication.ReplicationZookeeper: Creating dnds1-9%2C60020%2C1362533781275.1363116998811 with data null 2013-03-14 00:11:35,977 DEBUG org.apache.hadoop.hbase.replication.ReplicationZookeeper: Creating dnds1-9%2C60020%2C1362533781275.1363098998059 with data null 2013-03-14 00:11:35,978 DEBUG org.apache.hadoop.hbase.replication.ReplicationZookeeper: Creating dnds1-9%2C60020%2C1362533781275.1363026996488 with data null 2013-03-14 00:11:35,979 DEBUG org.apache.hadoop.hbase.replication.ReplicationZookeeper: Creating dnds1-9%2C60020%2C1362533781275.1363127799218 with data null 2013-03-14 00:11:35,991 DEBUG org.apache.hadoop.hbase.replication.ReplicationZookeeper: Creating dnds1-9%2C60020%2C1362533781275.1363034196762 with data null 2013-03-14 00:11:35,992 DEBUG org.apache.hadoop.hbase.replication.ReplicationZookeeper: Creating dnds1-9%2C60020%2C1362533781275.1362533787045 with data null 2013-03-14 00:11:35,993 DEBUG org.apache.hadoop.hbase.replication.ReplicationZookeeper: Creating dnds1-9%2C60020%2C1362533781275.1363023396288 with data null 2013-03-14 00:11:35,993 DEBUG org.apache.hadoop.hbase.replication.ReplicationZookeeper: The multi list size is: 29 2013-03-14 00:11:36,018 WARN org.apache.hadoop.hbase.replication.ReplicationZookeeper: Got exception in copyQueuesFromRSUsingMulti: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode at org.apache.zookeeper.KeeperException.create(KeeperException.java:111) at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:945) at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:911) at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.multi(RecoverableZooKeeper.java:531) at org.apache.hadoop.hbase.zookeeper.ZKUtil.multiOrSequential(ZKUtil.java:1436) at org.apache.hadoop.hbase.replication.ReplicationZookeeper.copyQueuesFromRSUsingMulti(ReplicationZookeeper.java:705) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager$NodeFailoverWorker.run(ReplicationSourceManager.java:590) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) 2013-03-14 00:11:36,019 INFO org.apache.hadoop.hbase.metrics: new MBeanInfo 2013-03-14 00:11:36,025 INFO org.apache.hadoop.hbase.metrics: new MBeanInfo 2013-03-14 00:11:36,025 INFO org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Getting 2 rs from peer cluster # 1 2013-03-14 00:11:36,025 INFO org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Choosing peer ist6-dnds1-3,60020,1359508079058 2013-03-14 00:11:36,025 INFO org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Choosing peer ist6-dnds1-11,60020,1359508114550 2013-03-14 00:11:36,033 INFO org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Getting 2 rs from peer cluster # 1 2013-03-14 00:11:36,033 INFO org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Choosing peer ist6-dnds1-10,60020,1359508114634 2013-03-14 00:11:36,033 INFO org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Choosing peer ist6-dnds1-6,60020,1359508078741 2013-03-14 00:11:37,028 INFO org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Replicating 4f3d5435-898c-47c2-8821-aeb01f9e87cc -> 74c750a5-4254-4a3b-ab12-063869759edd 2013-03-14 00:11:37,030 DEBUG org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Opening log for replication dnds1-8%2C60020%2C1363219865904.1363219868852 at 0 2013-03-14 00:11:37,036 INFO org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Replicating 4f3d5435-898c-47c2-8821-aeb01f9e87cc -> 74c750a5-4254-4a3b-ab12-063869759edd 2013-03-14 00:11:37,038 DEBUG org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Opening log for replication dnds1-9%2C60020%2C1362533781275.1362533787045 at 0 2013-03-14 00:11:37,046 DEBUG org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: currentNbOperations:0 and seenEntries:0 and size: 0 2013-03-14 00:11:37,046 INFO org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager: Done with the recovered queue 1-dnds1-8,60020,1363219865904 2013-03-14 00:11:37,048 INFO org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Finished recovering the queue 2013-03-14 00:11:37,048 DEBUG org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Source exiting 1 2013-03-14 00:11:37,055 DEBUG org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: currentNbOperations:0 and seenEntries:1 and size: 0 2013-03-14 00:11:37,055 INFO org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager: Going to report log #dnds1-9%2C60020%2C1362533781275.1362533787045 for position 372 in hdfs://cluster/hbase/.oldlogs/dnds1-9%2C60020%2C1362533781275.1362533787045 2013-03-14 00:11:37,074 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server dnds1-12,60020,1363219887328: Writing replication status org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /hbase/replication/rs/dnds1-12,60020,1363219887328/1-dnds1-9,60020,1362533781275-dnds1-11,60020,1362533806866-dnds1-8,60020,1363219865904/dnds1-9%2C60020%2C1362533781275.1362533787045 at org.apache.zookeeper.KeeperException.create(KeeperException.java:111) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1266) at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:349) at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:848) at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:900) at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:894) at org.apache.hadoop.hbase.replication.ReplicationZookeeper.writeReplicationStatus(ReplicationZookeeper.java:558) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.logPositionAndCleanOldLogs(ReplicationSourceManager.java:155) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:379) 2013-03-14 00:11:37,074 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: RegionServer abort: loaded coprocessors are: [] ------------------------- We then restarted all RS a 2nd time. This time they stay up, but now the log is filled with messages like these: ------------------------ 2013-03-14 01:00:37,998 DEBUG org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Opening log for replication dnds1-12%2C60020%2C1363220608780.1363220609572 at 0 2013-03-14 01:00:38,001 WARN org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: 1 Got: java.io.EOFException at java.io.DataInputStream.readFully(DataInputStream.java:180) at java.io.DataInputStream.readFully(DataInputStream.java:152) at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1800) at org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1765) at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1714) at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1728) at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.(SequenceFileLogReader.java:55) at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.init(SequenceFileLogReader.java:177) at org.apache.hadoop.hbase.regionserver.wal.HLog.getReader(HLog.java:728) at org.apache.hadoop.hbase.replication.regionserver.ReplicationHLogReaderManager.openReader(ReplicationHLogReaderManager.java:67) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:507) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:313) 2013-03-14 01:00:38,001 WARN org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Waited too long for this file, considering dumping 2013-03-14 01:00:38,001 DEBUG org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Unable to open a reader, sleeping 1000 times 10