Not a member of Pastebin yet?
Sign Up,
it unlocks many cool features!
- ---------- Forwarded message ----------
- From: Tatsuya Kawano <[email protected]>
- Date: 2011/3/5
- Subject: A kernel panic makes small HBase cluster to crush?
- Hi,
- I got this question at Hadoop User Group Japan mailing list, but I
- need some helps from the experts here. It looks like HDFS issue, maybe
- "append" related? but I'm not totally sure yet.
- The person who posted the original question is testing HA features in
- HBase 0.90.0 and ASF Hadoop 0.20.2 (with
- hadoop-core-0.20-append-r1056497.jar)
- His test cluster has only 3 nodes.
- Node 1: RS, DN, ZK plus HM, NN
- Node 2: RS, DN, ZK
- Node 3: RS, DN, ZK
- dfs.replication = 3
- He brought down Node 3 (which was handling Put requests from his test
- client) by a kernel panic ("echo c > /proc/sysrq-trigger"). But he
- also got Region Servers on Node 1 and Node 2 down with the following
- message.
- ---------------------------------------------------------------------
- 2011-03-01 23:13:13,056 FATAL
- org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region
- server serverName=ap12.secur2,60020,1298987576087, load=(requests=0,
- regions=4, usedHeap=218, maxHeap=1998): Replay of HLog required.
- Forcing server shutdown
- org.apache.hadoop.hbase.DroppedSnapshotException: region:
- Object_Speed_Test,
- 5003017357526424133520110201051038918,1298988549775.1dbc1bf84b48e1145638b3a3bc3ad1cd
- ---------------------------------------------------------------------
- He can easily reproduce this issue on his cluster.
- So, by looking at the above message, I thought there was something
- wrong with HDFS, and RS was reading corrupted HFile or something from
- HDFS.
- Then, we checked HDFS NN and DN logs, and it seems NN was confused and
- it wasn't able to allocate block for write.
- ---------------------------------------------------------------------
- 2011-03-01 23:13:13,006 INFO
- org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit:
- ugi=hbase,hadoop ip=/XX.XX.XX.XX cmd=create src=/hbase/
- Object_Speed_Test/1dbc1bf84b48e1145638b3a3bc3ad1cd/.tmp/
- 1275904589980700621 dst=null perm=hbase:supergroup:rw-r--r--
- 2011-03-01 23:13:13,048 INFO org.apache.hadoop.ipc.Server: IPC Server
- handler 2 on 9000, call addBlock(/hbase/Object_Speed_Test/
- 1dbc1bf84b48e1145638b3a3bc3ad1cd/.tmp/1275904589980700621,
- DFSClient_hb_rs_ap12.secur2,60020,1298987576087_1298987617433, null)
- from XX.XX.XX.XX:55462: error: java.io.IOException: File /hbase/
- Object_Speed_Test/1dbc1bf84b48e1145638b3a3bc3ad1cd/.tmp/
- 1275904589980700621 could only be replicated to 0 nodes, instead of 1
- java.io.IOException: File /hbase/Object_Speed_Test/
- 1dbc1bf84b48e1145638b3a3bc3ad1cd/.tmp/1275904589980700621 could only
- be replicated to 0 nodes, instead of 1
- ---------------------------------------------------------------------
- It seems the kernel panic on Node 3 put HDFS in a wrong state, so
- Region Servers couldn't write to and read from HDFS and had to shut
- themselves down.
- We couldn't find any more clues in the logs, but I pasted them here:
- http://pastebin.com/NYkNS1c1
- Since dfs.replication = 3, all Data Nodes were participating HLog
- write at the time Node 3 got the kernel panic. I think this somehow
- made the Name Node to think those Data Nodes were all gone. But I
- couldn't find the root cause of this issue.
- Also, he checked the network and disk spaces, and he believes there
- was no issue on them when he was testing.
- Thanks,
- --
- Tatsuya Kawano
- Tokyo, Japan
Advertisement
Add Comment
Please, Sign In to add comment