tatsuya6502

A kernel panic makes small HBase cluster to crush?

Mar 4th, 2011
69
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 3.46 KB | None | 0 0
  1.  
  2. ---------- Forwarded message ----------
  3. From: Tatsuya Kawano <[email protected]>
  4. Date: 2011/3/5
  5. Subject: A kernel panic makes small HBase cluster to crush?
  6.  
  7.  
  8. Hi,
  9.  
  10. I got this question at Hadoop User Group Japan mailing list, but I
  11. need some helps from the experts here. It looks like HDFS issue, maybe
  12. "append" related?  but I'm not totally sure yet.
  13.  
  14. The person who posted the original question is testing HA features in
  15. HBase 0.90.0 and ASF Hadoop 0.20.2 (with
  16. hadoop-core-0.20-append-r1056497.jar)
  17.  
  18. His test cluster has only 3 nodes.
  19.  
  20. Node 1: RS, DN, ZK   plus   HM, NN
  21. Node 2: RS, DN, ZK
  22. Node 3: RS, DN, ZK
  23.  
  24. dfs.replication = 3
  25.  
  26.  
  27. He brought down Node 3 (which was handling Put requests from his test
  28. client) by a kernel panic ("echo c > /proc/sysrq-trigger"). But he
  29. also got Region Servers on Node 1 and Node 2 down with the following
  30. message.
  31.  
  32. ---------------------------------------------------------------------
  33. 2011-03-01 23:13:13,056 FATAL
  34. org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region
  35. server serverName=ap12.secur2,60020,1298987576087, load=(requests=0,
  36. regions=4, usedHeap=218, maxHeap=1998): Replay of HLog required.
  37. Forcing server shutdown
  38. org.apache.hadoop.hbase.DroppedSnapshotException: region:
  39. Object_Speed_Test,
  40. 5003017357526424133520110201051038918,1298988549775.1dbc1bf84b48e1145638b3a3bc3ad1cd
  41. ---------------------------------------------------------------------
  42.  
  43. He can easily reproduce this issue on his cluster.
  44.  
  45. So, by looking at the above message, I thought there was something
  46. wrong with HDFS, and RS was reading corrupted HFile or something from
  47. HDFS.
  48.  
  49. Then, we checked HDFS NN and DN logs, and it seems NN was confused and
  50. it wasn't able to allocate block for write.
  51.  
  52. ---------------------------------------------------------------------
  53. 2011-03-01 23:13:13,006 INFO
  54. org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit:
  55. ugi=hbase,hadoop        ip=/XX.XX.XX.XX   cmd=create      src=/hbase/
  56. Object_Speed_Test/1dbc1bf84b48e1145638b3a3bc3ad1cd/.tmp/
  57. 1275904589980700621    dst=null        perm=hbase:supergroup:rw-r--r--
  58. 2011-03-01 23:13:13,048 INFO org.apache.hadoop.ipc.Server: IPC Server
  59. handler 2 on 9000, call addBlock(/hbase/Object_Speed_Test/
  60. 1dbc1bf84b48e1145638b3a3bc3ad1cd/.tmp/1275904589980700621,
  61. DFSClient_hb_rs_ap12.secur2,60020,1298987576087_1298987617433, null)
  62. from XX.XX.XX.XX:55462: error: java.io.IOException: File /hbase/
  63. Object_Speed_Test/1dbc1bf84b48e1145638b3a3bc3ad1cd/.tmp/
  64. 1275904589980700621 could only be replicated to 0 nodes, instead of 1
  65. java.io.IOException: File /hbase/Object_Speed_Test/
  66. 1dbc1bf84b48e1145638b3a3bc3ad1cd/.tmp/1275904589980700621 could only
  67. be replicated to 0 nodes, instead of 1
  68. ---------------------------------------------------------------------
  69.  
  70. It seems the kernel panic on Node 3 put HDFS in a wrong state, so
  71. Region Servers couldn't write to and read from HDFS and had to shut
  72. themselves down.
  73.  
  74. We couldn't find any more clues in the logs, but I pasted them here:
  75.  
  76. http://pastebin.com/NYkNS1c1
  77.  
  78.  
  79. Since dfs.replication = 3, all Data Nodes were participating HLog
  80. write at the time Node 3 got the kernel panic. I think this somehow
  81. made the Name Node to think those Data Nodes were all gone. But I
  82. couldn't find the root cause of this issue.
  83.  
  84. Also, he checked the network and disk spaces, and he believes there
  85. was no issue on them when he was testing.
  86.  
  87. Thanks,
  88.  
  89. --
  90. Tatsuya Kawano
  91. Tokyo, Japan
Advertisement
Add Comment
Please, Sign In to add comment