Kedrup

Untitled

Aug 20th, 2018
124
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 8.00 KB | None | 0 0
  1. The incomplete PGs might be your problem maker, as it does not serve client I/O.
  2. pg 68.2ec is stuck inactive since forever, current state incomplete, last acting [423,56,488]
  3.  
  4. Looking at the events, seems it was caused by osd.285 going down:
  5. From ceph.log.1
  6. ceph.log.1:2018-08-19 17:19:32.310537 osd.285 192.168.23.15:6877/10155 2284 : cluster [INF] 68.2ec restarting backfill on osd.381 from (0'0,0'0] MAX to 192035'869529763
  7. ceph.log.1:2018-08-19 17:19:32.312677 osd.285 192.168.23.15:6877/10155 2286 : cluster [INF] 68.2ec restarting backfill on osd.600 from (0'0,0'0] MAX to 192035'869529763
  8. ceph.log.1:2018-08-20 01:04:05.452808 osd.285 192.168.23.15:6877/10155 2302 : cluster [INF] 68.2ec restarting backfill on osd.197 from (0'0,0'0] MAX to 192035'869529763
  9. ceph.log.1:2018-08-20 01:06:16.183875 osd.285 192.168.23.15:6877/10155 2305 : cluster [INF] 68.2ec restarting backfill on osd.81 from (0'0,0'0] MAX to 192035'869529763
  10. ceph.log.1:2018-08-20 04:01:09.184159 osd.285 192.168.23.15:6811/1752275 1 : cluster [INF] 68.2ec restarting backfill on osd.335 from (0'0,0'0] MAX to 192035'869529763
  11. ^^^^
  12. The osd.285 seems to be the last one with latest PG replica version.
  13.  
  14. From Ceph.log
  15. 2018-08-20 05:25:47.451416 mon.0 192.168.11.2:6789/0 64292404 : cluster [INF] osd.285 marked itself down <-----------
  16. 2018-08-20 05:25:55.284967 mon.0 192.168.11.2:6789/0 64292419 : cluster [INF] pgmap v34470916: 53256 pgs: 53219 active+clean, 1 incomplete, 5 active+undersized+degraded+remapped+backfilling, 15 stale+active+remapped+wait_backfill, 1 active+undersized+degraded+remapped+wait_backfill, 5 active+undersized+remapped, 9 active+clean+scrubbing+deep, 1 active+remapped+backfilling; 358 TB data, 1085 TB used, 3468 TB / 4554 TB avail; 39809/1681304954 objects degraded (0.002%); 667651/1681304954 objects misplaced (0.040%)
  17. 2018-08-20 05:27:24.673943 osd.81 192.168.23.18:6825/2340091 2 : cluster [WRN] map e194708 wrongly marked me down
  18. 2018-08-20 05:27:46.880544 osd.197 192.168.23.13:6811/1423972 1 : cluster [WRN] map e194710 wrongly marked me down
  19. 2018-08-20 05:32:37.461916 mon.0 192.168.11.2:6789/0 64304195 : cluster [INF] osd.81 out (down for 314.041178)
  20. 2018-08-20 05:33:22.548076 mon.0 192.168.11.2:6789/0 64305536 : cluster [INF] osd.197 out (down for 335.826280)
  21. 2018-08-20 05:52:02.560491 mon.0 192.168.11.2:6789/0 64310961 : cluster [INF] osd.335 192.168.23.16:6856/3932313 boot
  22. 2018-08-20 06:24:56.338344 mon.0 192.168.11.2:6789/0 64314421 : cluster [INF] osd.81 192.168.23.18:6825/3023521 boot
  23. 2018-08-20 07:55:29.089247 mon.0 192.168.11.2:6789/0 64324821 : cluster [INF] osd.335 marked itself down
  24. 2018-08-20 07:56:07.000446 mon.0 192.168.11.2:6789/0 64324901 : cluster [INF] osd.335 192.168.23.16:6865/371702 boot
  25. 2018-08-20 07:57:48.264070 mon.0 192.168.11.2:6789/0 64325087 : cluster [INF] osd.197 192.168.23.13:6811/2577388 boot
  26. 2018-08-20 08:04:24.824382 mon.0 192.168.11.2:6789/0 64325893 : cluster [INF] osd.335 marked itself down
  27. 2018-08-20 08:06:44.732073 mon.0 192.168.11.2:6789/0 64326195 : cluster [INF] osd.335 192.168.23.16:6856/431465 boot
  28. 2018-08-20 08:06:57.373369 mon.0 192.168.11.2:6789/0 64326243 : cluster [INF] pgmap v34478962: 53256 pgs: 52863 active+clean, 15 active+recovery_wait+degraded, 1 active+degraded+remapped+backfilling, 2 active+undersized+degraded+remapped+backfilling, 2 active+remapped, 2 active+undersized+degraded+remapped+wait_backfill, 7 active+clean+scrubbing+deep, 61 active+remapped+backfilling, 1 down+incomplete, 142 active+remapped+wait_backfill, 160 active+recovery_wait+degraded+remapped; 358 TB data, 1086 TB used, 3460 TB / 4547 TB avail; 8155 kB/s rd, 7101 kB/s wr, 1443 op/s; 170971/1689619257 objects degraded (0.010%); 17034782/1689619257 objects misplaced (1.008%); 4706 MB/s, 2074 objects/s recovering
  29. 2018-08-20 05:25:47.451416 mon.0 192.168.11.2:6789/0 64292404 : cluster [INF] osd.285 marked itself down
  30. 2018-08-20 05:27:24.673943 osd.81 192.168.23.18:6825/2340091 2 : cluster [WRN] map e194708 wrongly marked me down
  31. 2018-08-20 05:27:46.880544 osd.197 192.168.23.13:6811/1423972 1 : cluster [WRN] map e194710 wrongly marked me down
  32. 2018-08-20 05:32:37.461916 mon.0 192.168.11.2:6789/0 64304195 : cluster [INF] osd.81 out (down for 314.041178)
  33. 2018-08-20 05:33:22.548076 mon.0 192.168.11.2:6789/0 64305536 : cluster [INF] osd.197 out (down for 335.826280)
  34. 2018-08-20 05:52:02.560491 mon.0 192.168.11.2:6789/0 64310961 : cluster [INF] osd.335 192.168.23.16:6856/3932313 boot
  35. 2018-08-20 06:24:56.338344 mon.0 192.168.11.2:6789/0 64314421 : cluster [INF] osd.81 192.168.23.18:6825/3023521 boot
  36. 2018-08-20 07:55:29.089247 mon.0 192.168.11.2:6789/0 64324821 : cluster [INF] osd.335 marked itself down
  37. 2018-08-20 07:56:07.000446 mon.0 192.168.11.2:6789/0 64324901 : cluster [INF] osd.335 192.168.23.16:6865/371702 boot
  38. 2018-08-20 07:57:48.264070 mon.0 192.168.11.2:6789/0 64325087 : cluster [INF] osd.197 192.168.23.13:6811/2577388 boot
  39. 2018-08-20 08:04:24.824382 mon.0 192.168.11.2:6789/0 64325893 : cluster [INF] osd.335 marked itself down
  40. 2018-08-20 08:06:44.732073 mon.0 192.168.11.2:6789/0 64326195 : cluster [INF] osd.335 192.168.23.16:6856/431465 boot
  41. 2018-08-20 08:06:57.373369 mon.0 192.168.11.2:6789/0 64326243 : cluster [INF] pgmap v34478962: 53256 pgs: 52863 active+clean, 15 active+recovery_wait+degraded, 1 active+degraded+remapped+backfilling, 2 active+undersized+degraded+remapped+backfilling, 2 active+remapped, 2 active+undersized+degraded+remapped+wait_backfill, 7 active+clean+scrubbing+deep, 61 active+remapped+backfilling, 1 down+incomplete, 142 active+remapped+wait_backfill, 160 active+recovery_wait+degraded+remapped; 358 TB data, 1086 TB used, 3460 TB / 4547 TB avail; 8155 kB/s rd, 7101 kB/s wr, 1443 op/s; 170971/1689619257 objects degraded (0.010%); 17034782/1689619257 objects misplaced (1.008%); 4706 MB/s, 2074 objects/s recovering
  42.  
  43.  
  44. Looking at the osd that are still down, osd.285 is between them:
  45. -13 0 root single_host
  46. -14 0 host STG_temp
  47. -2 436.79993 host STG_L3_13
  48. 437 7.28000 osd.437 down 0 1.00000
  49. 40 7.28000 osd.40 down 0 1.00000
  50. -3 436.79993 host STG-L2-21
  51. -5 400.39993 host STG_L3_15
  52. -6 429.51993 host STG_L3_16
  53. -7 429.51993 host STG_L2_18
  54. 583 7.28000 osd.583 down 0 1.00000
  55. 600 7.28000 osd.600 down 0 1.00000
  56. -8 414.95993 host STG-L2-20
  57. -9 429.51993 host STG_L2_22
  58. 271 7.28000 osd.271 down 0 1.00000
  59. 279 7.28000 osd.279 down 0 1.00000
  60. 285 7.28000 osd.285 down 0 1.00000 <-----------
  61. 65 7.28000 osd.65 down 0 1.00000
  62. -10 422.23993 host STG_L3_14
  63. -11 436.79993 host STG_L3_10
  64. 317 7.28000 osd.317 down 0 1.00000
  65. 326 7.28000 osd.326 down 0 1.00000
  66. -12 436.79993 host STG_L3_12
  67. 355 7.28000 osd.355 down 0 1.00000
  68. 377 7.28000 osd.377 down 0 1.00000
  69. 381 7.28000 osd.381 down 0 1.00000
  70. -4 407.67993 host STG-L2-19
  71. 6 7.28000 osd.6 down 0 1.00000
  72. 457 7.28000 osd.457 down 0 1.00000
  73. 533 7.28000 osd.533 down 0 1.00000
  74.  
  75.  
  76. My recommendation is to check what caused osd.285 to go down and fix that and if possible start osd.285.
  77. Take a look on /var/log/ceph/ceph-osd.285.log on STG_L2_22 node to identify the event.
Advertisement
Add Comment
Please, Sign In to add comment