Not a member of Pastebin yet?
Sign Up,
it unlocks many cool features!
- The incomplete PGs might be your problem maker, as it does not serve client I/O.
- pg 68.2ec is stuck inactive since forever, current state incomplete, last acting [423,56,488]
- Looking at the events, seems it was caused by osd.285 going down:
- From ceph.log.1
- ceph.log.1:2018-08-19 17:19:32.310537 osd.285 192.168.23.15:6877/10155 2284 : cluster [INF] 68.2ec restarting backfill on osd.381 from (0'0,0'0] MAX to 192035'869529763
- ceph.log.1:2018-08-19 17:19:32.312677 osd.285 192.168.23.15:6877/10155 2286 : cluster [INF] 68.2ec restarting backfill on osd.600 from (0'0,0'0] MAX to 192035'869529763
- ceph.log.1:2018-08-20 01:04:05.452808 osd.285 192.168.23.15:6877/10155 2302 : cluster [INF] 68.2ec restarting backfill on osd.197 from (0'0,0'0] MAX to 192035'869529763
- ceph.log.1:2018-08-20 01:06:16.183875 osd.285 192.168.23.15:6877/10155 2305 : cluster [INF] 68.2ec restarting backfill on osd.81 from (0'0,0'0] MAX to 192035'869529763
- ceph.log.1:2018-08-20 04:01:09.184159 osd.285 192.168.23.15:6811/1752275 1 : cluster [INF] 68.2ec restarting backfill on osd.335 from (0'0,0'0] MAX to 192035'869529763
- ^^^^
- The osd.285 seems to be the last one with latest PG replica version.
- From Ceph.log
- 2018-08-20 05:25:47.451416 mon.0 192.168.11.2:6789/0 64292404 : cluster [INF] osd.285 marked itself down <-----------
- 2018-08-20 05:25:55.284967 mon.0 192.168.11.2:6789/0 64292419 : cluster [INF] pgmap v34470916: 53256 pgs: 53219 active+clean, 1 incomplete, 5 active+undersized+degraded+remapped+backfilling, 15 stale+active+remapped+wait_backfill, 1 active+undersized+degraded+remapped+wait_backfill, 5 active+undersized+remapped, 9 active+clean+scrubbing+deep, 1 active+remapped+backfilling; 358 TB data, 1085 TB used, 3468 TB / 4554 TB avail; 39809/1681304954 objects degraded (0.002%); 667651/1681304954 objects misplaced (0.040%)
- 2018-08-20 05:27:24.673943 osd.81 192.168.23.18:6825/2340091 2 : cluster [WRN] map e194708 wrongly marked me down
- 2018-08-20 05:27:46.880544 osd.197 192.168.23.13:6811/1423972 1 : cluster [WRN] map e194710 wrongly marked me down
- 2018-08-20 05:32:37.461916 mon.0 192.168.11.2:6789/0 64304195 : cluster [INF] osd.81 out (down for 314.041178)
- 2018-08-20 05:33:22.548076 mon.0 192.168.11.2:6789/0 64305536 : cluster [INF] osd.197 out (down for 335.826280)
- 2018-08-20 05:52:02.560491 mon.0 192.168.11.2:6789/0 64310961 : cluster [INF] osd.335 192.168.23.16:6856/3932313 boot
- 2018-08-20 06:24:56.338344 mon.0 192.168.11.2:6789/0 64314421 : cluster [INF] osd.81 192.168.23.18:6825/3023521 boot
- 2018-08-20 07:55:29.089247 mon.0 192.168.11.2:6789/0 64324821 : cluster [INF] osd.335 marked itself down
- 2018-08-20 07:56:07.000446 mon.0 192.168.11.2:6789/0 64324901 : cluster [INF] osd.335 192.168.23.16:6865/371702 boot
- 2018-08-20 07:57:48.264070 mon.0 192.168.11.2:6789/0 64325087 : cluster [INF] osd.197 192.168.23.13:6811/2577388 boot
- 2018-08-20 08:04:24.824382 mon.0 192.168.11.2:6789/0 64325893 : cluster [INF] osd.335 marked itself down
- 2018-08-20 08:06:44.732073 mon.0 192.168.11.2:6789/0 64326195 : cluster [INF] osd.335 192.168.23.16:6856/431465 boot
- 2018-08-20 08:06:57.373369 mon.0 192.168.11.2:6789/0 64326243 : cluster [INF] pgmap v34478962: 53256 pgs: 52863 active+clean, 15 active+recovery_wait+degraded, 1 active+degraded+remapped+backfilling, 2 active+undersized+degraded+remapped+backfilling, 2 active+remapped, 2 active+undersized+degraded+remapped+wait_backfill, 7 active+clean+scrubbing+deep, 61 active+remapped+backfilling, 1 down+incomplete, 142 active+remapped+wait_backfill, 160 active+recovery_wait+degraded+remapped; 358 TB data, 1086 TB used, 3460 TB / 4547 TB avail; 8155 kB/s rd, 7101 kB/s wr, 1443 op/s; 170971/1689619257 objects degraded (0.010%); 17034782/1689619257 objects misplaced (1.008%); 4706 MB/s, 2074 objects/s recovering
- 2018-08-20 05:25:47.451416 mon.0 192.168.11.2:6789/0 64292404 : cluster [INF] osd.285 marked itself down
- 2018-08-20 05:27:24.673943 osd.81 192.168.23.18:6825/2340091 2 : cluster [WRN] map e194708 wrongly marked me down
- 2018-08-20 05:27:46.880544 osd.197 192.168.23.13:6811/1423972 1 : cluster [WRN] map e194710 wrongly marked me down
- 2018-08-20 05:32:37.461916 mon.0 192.168.11.2:6789/0 64304195 : cluster [INF] osd.81 out (down for 314.041178)
- 2018-08-20 05:33:22.548076 mon.0 192.168.11.2:6789/0 64305536 : cluster [INF] osd.197 out (down for 335.826280)
- 2018-08-20 05:52:02.560491 mon.0 192.168.11.2:6789/0 64310961 : cluster [INF] osd.335 192.168.23.16:6856/3932313 boot
- 2018-08-20 06:24:56.338344 mon.0 192.168.11.2:6789/0 64314421 : cluster [INF] osd.81 192.168.23.18:6825/3023521 boot
- 2018-08-20 07:55:29.089247 mon.0 192.168.11.2:6789/0 64324821 : cluster [INF] osd.335 marked itself down
- 2018-08-20 07:56:07.000446 mon.0 192.168.11.2:6789/0 64324901 : cluster [INF] osd.335 192.168.23.16:6865/371702 boot
- 2018-08-20 07:57:48.264070 mon.0 192.168.11.2:6789/0 64325087 : cluster [INF] osd.197 192.168.23.13:6811/2577388 boot
- 2018-08-20 08:04:24.824382 mon.0 192.168.11.2:6789/0 64325893 : cluster [INF] osd.335 marked itself down
- 2018-08-20 08:06:44.732073 mon.0 192.168.11.2:6789/0 64326195 : cluster [INF] osd.335 192.168.23.16:6856/431465 boot
- 2018-08-20 08:06:57.373369 mon.0 192.168.11.2:6789/0 64326243 : cluster [INF] pgmap v34478962: 53256 pgs: 52863 active+clean, 15 active+recovery_wait+degraded, 1 active+degraded+remapped+backfilling, 2 active+undersized+degraded+remapped+backfilling, 2 active+remapped, 2 active+undersized+degraded+remapped+wait_backfill, 7 active+clean+scrubbing+deep, 61 active+remapped+backfilling, 1 down+incomplete, 142 active+remapped+wait_backfill, 160 active+recovery_wait+degraded+remapped; 358 TB data, 1086 TB used, 3460 TB / 4547 TB avail; 8155 kB/s rd, 7101 kB/s wr, 1443 op/s; 170971/1689619257 objects degraded (0.010%); 17034782/1689619257 objects misplaced (1.008%); 4706 MB/s, 2074 objects/s recovering
- Looking at the osd that are still down, osd.285 is between them:
- -13 0 root single_host
- -14 0 host STG_temp
- -2 436.79993 host STG_L3_13
- 437 7.28000 osd.437 down 0 1.00000
- 40 7.28000 osd.40 down 0 1.00000
- -3 436.79993 host STG-L2-21
- -5 400.39993 host STG_L3_15
- -6 429.51993 host STG_L3_16
- -7 429.51993 host STG_L2_18
- 583 7.28000 osd.583 down 0 1.00000
- 600 7.28000 osd.600 down 0 1.00000
- -8 414.95993 host STG-L2-20
- -9 429.51993 host STG_L2_22
- 271 7.28000 osd.271 down 0 1.00000
- 279 7.28000 osd.279 down 0 1.00000
- 285 7.28000 osd.285 down 0 1.00000 <-----------
- 65 7.28000 osd.65 down 0 1.00000
- -10 422.23993 host STG_L3_14
- -11 436.79993 host STG_L3_10
- 317 7.28000 osd.317 down 0 1.00000
- 326 7.28000 osd.326 down 0 1.00000
- -12 436.79993 host STG_L3_12
- 355 7.28000 osd.355 down 0 1.00000
- 377 7.28000 osd.377 down 0 1.00000
- 381 7.28000 osd.381 down 0 1.00000
- -4 407.67993 host STG-L2-19
- 6 7.28000 osd.6 down 0 1.00000
- 457 7.28000 osd.457 down 0 1.00000
- 533 7.28000 osd.533 down 0 1.00000
- My recommendation is to check what caused osd.285 to go down and fix that and if possible start osd.285.
- Take a look on /var/log/ceph/ceph-osd.285.log on STG_L2_22 node to identify the event.
Advertisement
Add Comment
Please, Sign In to add comment