Guest User

drbd integrity issues

a guest
Mar 25th, 2013
134
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
  1. *************************************************************************************************************************
  2.  
  3. PHASE 1: VERIFICATION of HOST1
  4.  
  5. First verification after "data-integrity-alg" enabled: ALL GOOD
  6.  
  7. ...
  8. Mar 24 17:31:22 host1 kernel: block drbd0: conn( Connected -> VerifyS )
  9. Mar 24 17:31:22 host1 kernel: block drbd0: Starting Online Verify from sector 0
  10. Mar 24 19:33:31 host1 kernel: block drbd0: Online verify done (total 7328 sec; paused 0 sec; 119932 K/sec)
  11. Mar 24 19:33:31 host1 kernel: block drbd0: conn( VerifyS -> Connected )
  12. Mar 24 19:33:31 host1 kernel: block drbd0: bitmap WRITE of 0 pages took 0 jiffies
  13. Mar 24 19:33:31 host1 kernel: block drbd0: 0 KB (0 bits) marked out-of-sync by on disk bit-map.
  14.  
  15. Second verification after "data-integrity-alg" enabled: BOTH SERVERS HUNG
  16.  
  17. Mar 25 00:42:01 host1 kernel: block drbd0: conn( Connected -> VerifyS )
  18. Mar 25 00:42:01 host1 kernel: block drbd0: Starting Online Verify from sector 0
  19. Mar 25 01:07:07 host1 kernel: block drbd0: [drbd0_worker/3644] sock_sendmsg time expired, ko = 4294967295
  20. Mar 25 01:07:13 host1 kernel: block drbd0: [drbd0_worker/3644] sock_sendmsg time expired, ko = 4294967294
  21. Mar 25 01:07:19 host1 kernel: block drbd0: [drbd0_worker/3644] sock_sendmsg time expired, ko = 4294967293
  22. ...
  23.  
  24.  
  25.  
  26. PHASE 1: VERIFICATION of HOST2
  27.  
  28. First verification after "data-integrity-alg" enabled: ALL GOOD
  29.  
  30. ...
  31. Mar 24 17:31:22 host2 kernel: block drbd0: conn( Connected -> VerifyT )
  32. Mar 24 17:31:22 host2 kernel: block drbd0: Online Verify start sector: 0
  33. Mar 24 19:33:31 host2 kernel: block drbd0: Online verify done (total 7328 sec; paused 0 sec; 119932 K/sec)
  34. Mar 24 19:33:31 host2 kernel: block drbd0: conn( VerifyT -> Connected )
  35. Mar 24 19:33:31 host2 kernel: block drbd0: bitmap WRITE of 0 pages took 0 jiffies
  36. Mar 24 19:33:31 host2 kernel: block drbd0: 0 KB (0 bits) marked out-of-sync by on disk bit-map.
  37.  
  38. Second verification after "data-integrity-alg" enabled: BOTH SERVERS HUNG
  39.  
  40. Mar 25 00:42:01 host2 kernel: block drbd0: conn( Connected -> VerifyT )
  41. Mar 25 00:42:01 host2 kernel: block drbd0: Online Verify start sector: 0
  42. Mar 25 01:06:58 host2 kernel: block drbd0: kvm[172358] Concurrent local write detected! [DISCARD L] new: 989901215s +4096; pending: 989901215s +4096
  43. Mar 25 01:11:08 host2 kernel: block drbd0: [drbd0_worker/3754] sock_sendmsg time expired, ko = 4294967295
  44. Mar 25 01:11:14 host2 kernel: block drbd0: [drbd0_worker/3754] sock_sendmsg time expired, ko = 4294967294
  45. Mar 25 01:11:20 host2 kernel: block drbd0: [drbd0_worker/3754] sock_sendmsg time expired, ko = 4294967293
  46. ...
  47.  
  48. *************************************************************************************************************************
  49.  
  50. PHASE 2: HOST1
  51.  
  52. Resync after host2 rebooted: everything seems okay
  53.  
  54. Mar 25 09:01:00 host1 kernel: block drbd0: Handshake successful: Agreed network protocol version 96
  55. Mar 25 09:01:00 host1 kernel: block drbd0: Peer authenticated using 20 bytes of 'sha1' HMAC
  56. Mar 25 09:01:00 host1 kernel: block drbd0: conn( WFConnection -> WFReportParams )
  57. Mar 25 09:01:00 host1 kernel: block drbd0: Starting asender thread (from drbd0_receiver [136758])
  58. Mar 25 09:01:00 host1 kernel: block drbd0: data-integrity-alg: crc32c
  59. Mar 25 09:01:00 host1 kernel: block drbd0: drbd_sync_handshake:
  60. Mar 25 09:01:00 host1 kernel: block drbd0: self 5114F9434B703F5D:9C28E5306F77E971:4ED5A202300A9955:4ED4A202300A9955 bits:17622 flags:0
  61. Mar 25 09:01:00 host1 kernel: block drbd0: peer 9C28E5306F77E970:0000000000000000:4ED5A202300A9954:4ED4A202300A9955 bits:130048 flags:2
  62. Mar 25 09:01:00 host1 kernel: block drbd0: uuid_compare()=1 by rule 70
  63. Mar 25 09:01:00 host1 kernel: block drbd0: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) pdsk( DUnknown -> Consistent )
  64. Mar 25 09:01:00 host1 kernel: block drbd0: peer( Secondary -> Primary )
  65. Mar 25 09:01:00 host1 kernel: block drbd0: helper command: /sbin/drbdadm before-resync-source minor-0
  66. Mar 25 09:01:00 host1 kernel: block drbd0: helper command: /sbin/drbdadm before-resync-source minor-0 exit code 0 (0x0)
  67. Mar 25 09:01:00 host1 kernel: block drbd0: conn( WFBitMapS -> SyncSource ) pdsk( Consistent -> Inconsistent )
  68. Mar 25 09:01:00 host1 kernel: block drbd0: Began resync as SyncSource (will sync 590684 KB [147671 bits set]).
  69. Mar 25 09:01:00 host1 kernel: block drbd0: updated sync UUID 5114F9434B703F5D:9C29E5306F77E971:9C28E5306F77E971:4ED5A202300A9955
  70. Mar 25 09:01:07 host1 kernel: block drbd0: Resync done (total 6 sec; paused 0 sec; 98444 K/sec)
  71. Mar 25 09:01:07 host1 kernel: block drbd0: updated UUIDs 5114F9434B703F5D:0000000000000000:9C29E5306F77E971:9C28E5306F77E971
  72. Mar 25 09:01:07 host1 kernel: block drbd0: conn( SyncSource -> Connected ) pdsk( Inconsistent -> UpToDate )
  73. Mar 25 09:01:07 host1 kernel: block drbd0: bitmap WRITE of 6375 pages took 55 jiffies
  74. Mar 25 09:01:07 host1 kernel: block drbd0: 0 KB (0 bits) marked out-of-sync by on disk bit-map.
  75.  
  76. Split brain after 10 minutes of normal work
  77.  
  78. Mar 25 09:11:43 host1 kernel: block drbd0: Digest mismatch, buffer modified by upper layers during write: 1274046648s +4096
  79. Mar 25 09:11:43 host1 kernel: block drbd0: sock was shut down by peer
  80. Mar 25 09:11:43 host1 kernel: block drbd0: peer( Primary -> Unknown ) conn( Connected -> BrokenPipe ) pdsk( UpToDate -> DUnknown )
  81. Mar 25 09:11:43 host1 kernel: block drbd0: short read expecting header on sock: r=0
  82. Mar 25 09:11:43 host1 kernel: block drbd0: new current UUID 72BF71273D52849B:5114F9434B703F5D:9C29E5306F77E971:9C28E5306F77E971
  83. Mar 25 09:11:43 host1 kernel: block drbd0: meta connection shut down by peer.
  84. Mar 25 09:11:43 host1 kernel: block drbd0: asender terminated
  85. Mar 25 09:11:43 host1 kernel: block drbd0: Terminating asender thread
  86. Mar 25 09:11:43 host1 kernel: block drbd0: Connection closed
  87. Mar 25 09:11:43 host1 kernel: block drbd0: conn( BrokenPipe -> Unconnected )
  88. Mar 25 09:11:43 host1 kernel: block drbd0: receiver terminated
  89. Mar 25 09:11:43 host1 kernel: block drbd0: Restarting receiver thread
  90. Mar 25 09:11:43 host1 kernel: block drbd0: receiver (re)started
  91. Mar 25 09:11:43 host1 kernel: block drbd0: conn( Unconnected -> WFConnection )
  92. Mar 25 09:11:43 host1 kernel: block drbd0: Handshake successful: Agreed network protocol version 96
  93. Mar 25 09:11:43 host1 kernel: block drbd0: Peer authenticated using 20 bytes of 'sha1' HMAC
  94. Mar 25 09:11:43 host1 kernel: block drbd0: conn( WFConnection -> WFReportParams )
  95. Mar 25 09:11:43 host1 kernel: block drbd0: Starting asender thread (from drbd0_receiver [136758])
  96. Mar 25 09:11:43 host1 kernel: block drbd0: data-integrity-alg: crc32c
  97. Mar 25 09:11:43 host1 kernel: block drbd0: drbd_sync_handshake:
  98. Mar 25 09:11:43 host1 kernel: block drbd0: self 72BF71273D52849B:5114F9434B703F5D:9C29E5306F77E971:9C28E5306F77E971 bits:175 flags:0
  99. Mar 25 09:11:43 host1 kernel: block drbd0: peer 0225AA4EFAB3BE37:5114F9434B703F5D:9C29E5306F77E971:9C28E5306F77E971 bits:0 flags:0
  100. Mar 25 09:11:43 host1 kernel: block drbd0: uuid_compare()=100 by rule 90
  101. Mar 25 09:11:43 host1 kernel: block drbd0: helper command: /sbin/drbdadm initial-split-brain minor-0
  102. Mar 25 09:11:43 host1 kernel: block drbd0: helper command: /sbin/drbdadm initial-split-brain minor-0 exit code 0 (0x0)
  103. Mar 25 09:11:43 host1 kernel: block drbd0: Split-Brain detected but unresolved, dropping connection!
  104. Mar 25 09:11:43 host1 kernel: block drbd0: helper command: /sbin/drbdadm split-brain minor-0
  105. Mar 25 09:11:43 host1 kernel: block drbd0: helper command: /sbin/drbdadm split-brain minor-0 exit code 0 (0x0)
  106. Mar 25 09:11:43 host1 kernel: block drbd0: conn( WFReportParams -> Disconnecting )
  107. Mar 25 09:11:43 host1 kernel: block drbd0: error receiving ReportState, l: 4!
  108. Mar 25 09:11:43 host1 kernel: block drbd0: asender terminated
  109. Mar 25 09:11:43 host1 kernel: block drbd0: Terminating asender thread
  110. Mar 25 09:11:43 host1 kernel: block drbd0: Connection closed
  111. Mar 25 09:11:43 host1 kernel: block drbd0: conn( Disconnecting -> StandAlone )
  112. Mar 25 09:11:43 host1 kernel: block drbd0: receiver terminated
  113. Mar 25 09:11:43 host1 kernel: block drbd0: Terminating receiver thread
  114.  
  115. And then I had to disable data-integrity-alg and solve "split brain" to make servers work again
  116.  
  117.  
  118. PHASE 2: HOST2
  119.  
  120. Resync after host2 rebooted: everything seems okay
  121.  
  122. Mar 25 09:01:00 host2 kernel: block drbd0: Handshake successful: Agreed network protocol version 96
  123. Mar 25 09:01:00 host2 kernel: block drbd0: Peer authenticated using 20 bytes of 'sha1' HMAC
  124. Mar 25 09:01:00 host2 kernel: block drbd0: conn( WFConnection -> WFReportParams )
  125. Mar 25 09:01:00 host2 kernel: block drbd0: Starting asender thread (from drbd0_receiver [2450])
  126. Mar 25 09:01:00 host2 kernel: block drbd0: data-integrity-alg: crc32c
  127. Mar 25 09:01:00 host2 kernel: block drbd0: drbd_sync_handshake:
  128. Mar 25 09:01:00 host2 kernel: block drbd0: self 9C28E5306F77E970:0000000000000000:4ED5A202300A9954:4ED4A202300A9955 bits:130048 flags:0
  129. Mar 25 09:01:00 host2 kernel: block drbd0: peer 5114F9434B703F5D:9C28E5306F77E971:4ED5A202300A9955:4ED4A202300A9955 bits:17622 flags:0
  130. Mar 25 09:01:00 host2 kernel: block drbd0: uuid_compare()=-1 by rule 50
  131. Mar 25 09:01:00 host2 kernel: block drbd0: peer( Unknown -> Primary ) conn( WFReportParams -> WFBitMapT ) disk( UpToDate -> Outdated ) pdsk( DUnknown -> UpToDate )
  132. Mar 25 09:01:00 host2 kernel: block drbd0: role( Secondary -> Primary )
  133. Mar 25 09:01:00 host2 kernel: DLM (built Mar 18 2013 06:28:24) installed
  134. Mar 25 09:01:00 host2 kernel: block drbd0: conn( WFBitMapT -> WFSyncUUID )
  135. Mar 25 09:01:01 host2 kernel: block drbd0: updated sync uuid 9C29E5306F77E971:0000000000000000:4ED5A202300A9954:4ED4A202300A9955
  136. Mar 25 09:01:01 host2 kernel: block drbd0: helper command: /sbin/drbdadm before-resync-target minor-0
  137. Mar 25 09:01:01 host2 kernel: block drbd0: helper command: /sbin/drbdadm before-resync-target minor-0 exit code 0 (0x0)
  138. Mar 25 09:01:01 host2 kernel: block drbd0: conn( WFSyncUUID -> SyncTarget ) disk( Outdated -> Inconsistent )
  139. Mar 25 09:01:01 host2 kernel: block drbd0: Began resync as SyncTarget (will sync 590684 KB [147671 bits set]).
  140. Mar 25 09:01:07 host2 kernel: block drbd0: Resync done (total 6 sec; paused 0 sec; 98444 K/sec)
  141. Mar 25 09:01:07 host2 kernel: block drbd0: updated UUIDs 5114F9434B703F5D:0000000000000000:9C29E5306F77E971:9C28E5306F77E971
  142. Mar 25 09:01:07 host2 kernel: block drbd0: conn( SyncTarget -> Connected ) disk( Inconsistent -> UpToDate )
  143. Mar 25 09:01:07 host2 kernel: block drbd0: helper command: /sbin/drbdadm after-resync-target minor-0
  144. Mar 25 09:01:07 host2 kernel: block drbd0: helper command: /sbin/drbdadm after-resync-target minor-0 exit code 0 (0x0)
  145. Mar 25 09:01:07 host2 kernel: block drbd0: bitmap WRITE of 6375 pages took 22 jiffies
  146. Mar 25 09:01:07 host2 kernel: block drbd0: 0 KB (0 bits) marked out-of-sync by on disk bit-map.
  147.  
  148. Split brain after 10 minutes of normal work
  149.  
  150. Mar 25 09:11:43 host2 kernel: block drbd0: Digest integrity check FAILED: 1274046648s +4096
  151. Mar 25 09:11:43 host2 kernel: block drbd0: error receiving Data, l: 4124!
  152. Mar 25 09:11:43 host2 kernel: block drbd0: peer( Primary -> Unknown ) conn( Connected -> ProtocolError ) pdsk( UpToDate -> DUnknown )
  153. Mar 25 09:11:43 host2 kernel: block drbd0: new current UUID 0225AA4EFAB3BE37:5114F9434B703F5D:9C29E5306F77E971:9C28E5306F77E971
  154. Mar 25 09:11:43 host2 kernel: block drbd0: asender terminated
  155. Mar 25 09:11:43 host2 kernel: block drbd0: Terminating asender thread
  156. Mar 25 09:11:43 host2 kernel: block drbd0: Connection closed
  157. Mar 25 09:11:43 host2 kernel: block drbd0: conn( ProtocolError -> Unconnected )
  158. Mar 25 09:11:43 host2 kernel: block drbd0: receiver terminated
  159. Mar 25 09:11:43 host2 kernel: block drbd0: Restarting receiver thread
  160. Mar 25 09:11:43 host2 kernel: block drbd0: receiver (re)started
  161. Mar 25 09:11:43 host2 kernel: block drbd0: conn( Unconnected -> WFConnection )
  162. Mar 25 09:11:44 host2 kernel: block drbd0: Handshake successful: Agreed network protocol version 96
  163. Mar 25 09:11:44 host2 kernel: block drbd0: Peer authenticated using 20 bytes of 'sha1' HMAC
  164. Mar 25 09:11:44 host2 kernel: block drbd0: conn( WFConnection -> WFReportParams )
  165. Mar 25 09:11:44 host2 kernel: block drbd0: Starting asender thread (from drbd0_receiver [2450])
  166. Mar 25 09:11:44 host2 kernel: block drbd0: data-integrity-alg: crc32c
  167. Mar 25 09:11:44 host2 kernel: block drbd0: drbd_sync_handshake:
  168. Mar 25 09:11:44 host2 kernel: block drbd0: self 0225AA4EFAB3BE37:5114F9434B703F5D:9C29E5306F77E971:9C28E5306F77E971 bits:0 flags:0
  169. Mar 25 09:11:44 host2 kernel: block drbd0: peer 72BF71273D52849B:5114F9434B703F5D:9C29E5306F77E971:9C28E5306F77E971 bits:175 flags:0
  170. Mar 25 09:11:44 host2 kernel: block drbd0: uuid_compare()=100 by rule 90
  171. Mar 25 09:11:44 host2 kernel: block drbd0: helper command: /sbin/drbdadm initial-split-brain minor-0
  172. Mar 25 09:11:44 host2 kernel: block drbd0: helper command: /sbin/drbdadm initial-split-brain minor-0 exit code 0 (0x0)
  173. Mar 25 09:11:44 host2 kernel: block drbd0: Split-Brain detected but unresolved, dropping connection!
  174. Mar 25 09:11:44 host2 kernel: block drbd0: helper command: /sbin/drbdadm split-brain minor-0
  175. Mar 25 09:11:44 host2 kernel: block drbd0: helper command: /sbin/drbdadm split-brain minor-0 exit code 0 (0x0)
  176. Mar 25 09:11:44 host2 kernel: block drbd0: conn( WFReportParams -> Disconnecting )
  177. Mar 25 09:11:44 host2 kernel: block drbd0: error receiving ReportState, l: 4!
  178. Mar 25 09:11:44 host2 kernel: block drbd0: asender terminated
  179. Mar 25 09:11:44 host2 kernel: block drbd0: Terminating asender thread
  180. Mar 25 09:11:44 host2 kernel: block drbd0: Connection closed
  181. Mar 25 09:11:44 host2 kernel: block drbd0: conn( Disconnecting -> StandAlone )
  182. Mar 25 09:11:44 host2 kernel: block drbd0: receiver terminated
  183. Mar 25 09:11:44 host2 kernel: block drbd0: Terminating receiver thread
  184.  
  185. And then I had to disable data-integrity-alg and solve "split brain" to make servers work again
  186. *************************************************************************************************************************
RAW Paste Data