Advertisement
Guest User

Untitled

a guest
Jul 19th, 2018
58
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 9.81 KB | None | 0 0
  1. Hey,
  2.  
  3. I've been taking a look at this bug in the last couple of days. Here are some info:
  4.  
  5. 1. I could reproduce it in our system with Ubuntu 18.04 and current upstream kernel (v4.18-rc4)
  6. (Logs will be attached next)
  7.  
  8. 2. I did a simple patch that "avoid" the crash, but I'm concerned about other trace that's showing up.
  9.  
  10. Explaining the patch: at some point, the fcp_wq data structure (lpfc_queue) is getting a NULL value. Probably, during the recovery process due to the EEH injection.
  11. So, I just check it at the beginning of the function:
  12.  
  13. ## Patch ##
  14.  
  15. ---------------------------------------------------------------------
  16. --- a/drivers/scsi/lpfc/lpfc_sli.c
  17. +++ b/drivers/scsi/lpfc/lpfc_sli.c
  18. @@ -3981,6 +3981,11 @@ lpfc_sli_flush_fcp_rings(struct lpfc_hba *phba)
  19. phba->hba_flag |= HBA_FCP_IOQ_FLUSH;
  20. spin_unlock_irq(&phba->hbalock);
  21.  
  22. + if (unlikely(!phba->sli4_hba.fcp_wq)) {
  23. + printk("lpfc_sli_flush_fcp_rings -- unlikely(!phba->sli4_hba.fcp_wq) -- rrg\n");
  24. + return;
  25. + }
  26. +
  27. /* Look on all the FCP Rings for the iotag */
  28. if (phba->sli_rev >= LPFC_SLI_REV4) {
  29. for (i = 0; i < phba->cfg_fcp_io_channel; i++) {
  30.  
  31. ----------------------------------------------------------------------
  32.  
  33. 3. After applying the patch, the system doesn't crash, but I see the following trace:
  34.  
  35.  
  36. [ 707.437261] ------------[ cut here ]------------
  37. [ 707.437313] Trying to free already-free IRQ 89
  38. [ 707.437363] WARNING: CPU: 34 PID: 782 at kernel/irq/manage.c:1583 __free_irq+0xe0/0x420
  39. [ 707.437417] Modules linked in: xt_CHECKSUM iptable_mangle ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp bridge stp llc ebtable_filter ebtables devlink ip6table_filter ip6_tables iptable_filter kvm_hv kvm input_leds joydev mac_hid at24 uio_pdrv_genirq ofpart cmdlinepart powernv_flash ipmi_powernv ipmi_devintf sch_fq_codel uio ipmi_msghandler ib_iser ibmpowernv mtd rdma_cm opal_prd iw_cm vmx_crypto ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear hid_generic usbhid hid ses enclosure scsi_transport_sas ast i2c_algo_bit
  40. [ 707.437916] ttm drm_kms_helper lpfc syscopyarea sysfillrect sysimgblt fb_sys_fops drm qla2xxx nvmet_fc nvmet nvme_fc nvme_fabrics i40e aacraid megaraid_sas crct10dif_vpmsum crc32c_vpmsum scsi_transport_fc drm_panel_orientation_quirks
  41. [ 707.438063] CPU: 34 PID: 782 Comm: eehd Not tainted 4.18.0-rc4-rosattig+ #2
  42. [ 707.438109] NIP: c00000000019b530 LR: c00000000019b52c CTR: 000000003003dfbc
  43. [ 707.438162] REGS: c0000007f30ef680 TRAP: 0700 Not tainted (4.18.0-rc4-rosattig+)
  44. [ 707.438215] MSR: 9000000000029033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: 48044222 XER: 20040000
  45. [ 707.438290] CFAR: c0000000001125f0 IRQMASK: 1
  46. GPR00: c00000000019b52c c0000007f30ef900 c00000000178be00 0000000000000022
  47. GPR04: 0000000000000001 0000000000000dd5 9000000000009033 0000000031d90058
  48. GPR08: 0000000000000007 0000000000000007 0000000000000001 9000000000001003
  49. GPR12: 0000000000004000 c0000007fffd8c00 c000000000143668 c0000007fb0eb800
  50. GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
  51. GPR20: c0002006e7c743e8 c0002006e7c743c8 c0002006e7c743d8 0000000010624dd3
  52. GPR24: 0000000000000001 0000000000000000 0000000000000059 c0000007f34d0d8c
  53. GPR28: c0000007f34d0e40 c0002006e7c74000 c0000007f34d0c00 c0002006e7c74000
  54. [ 707.438746] NIP [c00000000019b530] __free_irq+0xe0/0x420
  55. [ 707.438784] LR [c00000000019b52c] __free_irq+0xdc/0x420
  56. [ 707.438819] Call Trace:
  57. [ 707.438840] [c0000007f30ef900] [c00000000019b52c] __free_irq+0xdc/0x420 (unreliable)
  58. [ 707.438895] [c0000007f30ef9a0] [c00000000019b978] free_irq+0x78/0xe0
  59. [ 707.438952] [c0000007f30ef9d0] [c008000018b9a4bc] lpfc_sli4_disable_intr+0xd4/0x170 [lpfc]
  60. [ 707.439017] [c0000007f30efa00] [c008000018baebc8] lpfc_pci_remove_one+0x780/0xa70 [lpfc]
  61. [ 707.439073] [c0000007f30efad0] [c00000000079a75c] pci_device_remove+0x6c/0x120
  62. [ 707.439129] [c0000007f30efb10] [c00000000089a204] device_release_driver_internal+0x294/0x380
  63. [ 707.439193] [c0000007f30efb60] [c00000000078ddd8] pci_stop_bus_device+0x98/0xe0
  64. [ 707.439248] [c0000007f30efba0] [c00000000078dfc8] pci_stop_and_remove_bus_device+0x28/0x40
  65. [ 707.439303] [c0000007f30efbd0] [c000000000061640] pci_hp_remove_devices+0x90/0x130
  66. [ 707.439359] [c0000007f30efc60] [c000000000041600] eeh_handle_normal_event+0x280/0x680
  67. [ 707.439414] [c0000007f30efd10] [c000000000042030] eeh_event_handler+0x130/0x1e0
  68. [ 707.439468] [c0000007f30efdc0] [c000000000143808] kthread+0x1a8/0x1b0
  69. [ 707.439515] [c0000007f30efe30] [c00000000000b65c] ret_from_kernel_thread+0x5c/0x80
  70. [ 707.439568] Instruction dump:
  71. [ 707.439597] e93f0008 7fa9e840 419e0098 7feafb78 ebea0018 2fbf0000 409effe8 3c62ff86
  72. [ 707.439670] 7f44d378 3863d1d8 4bf77061 60000000 <0fe00000> 7f24cb78 7f63db78 48bca12d
  73. [ 707.439728] ---[ end trace e4a3a385a4cfeb7a ]---
  74. [ 707.439816] lpfc 0034:01:00.1: 1:3317 HBA not functional: IP Reset Failed try: echo fw_reset > board_mode
  75. [ 707.442211] ------------[ cut here ]------------
  76. [ 707.442251] lpfc 0034:01:00.1: disabling already-disabled device
  77. [ 707.442272] WARNING: CPU: 34 PID: 782 at drivers/pci/pci.c:1658 pci_disable_device+0x140/0x170
  78. [ 707.450750] Modules linked in: xt_CHECKSUM iptable_mangle ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp bridge stp llc ebtable_filter ebtables devlink ip6table_filter ip6_tables iptable_filter kvm_hv kvm input_leds joydev mac_hid at24 uio_pdrv_genirq ofpart cmdlinepart powernv_flash ipmi_powernv ipmi_devintf sch_fq_codel uio ipmi_msghandler ib_iser ibmpowernv mtd rdma_cm opal_prd iw_cm vmx_crypto ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear hid_generic usbhid hid ses enclosure scsi_transport_sas ast i2c_algo_bit
  79. [ 707.520085] ttm drm_kms_helper lpfc syscopyarea sysfillrect sysimgblt fb_sys_fops drm qla2xxx nvmet_fc nvmet nvme_fc nvme_fabrics i40e aacraid megaraid_sas crct10dif_vpmsum crc32c_vpmsum scsi_transport_fc drm_panel_orientation_quirks
  80. [ 707.540883] CPU: 34 PID: 782 Comm: eehd Tainted: G W 4.18.0-rc4-rosattig+ #2
  81. [ 707.549207] NIP: c000000000794530 LR: c00000000079452c CTR: c000000000d5d6e0
  82. [ 707.557528] REGS: c0000007f30ef6e0 TRAP: 0700 Tainted: G W (4.18.0-rc4-rosattig+)
  83. [ 707.565851] MSR: 9000000000029033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: 48044222 XER: 20040000
  84. [ 707.574170] CFAR: c0000000001125f0 IRQMASK: 0
  85. GPR00: c00000000079452c c0000007f30ef960 c00000000178be00 0000000000000034
  86. GPR04: 0000000000000001 0000000000000df3 65642064656c6261 0000000000000000
  87. GPR08: 0000000000000007 0000000000000007 0000000000000001 9000000000001003
  88. GPR12: 0000000000004000 c0000007fffd8c00 c000000000143668 c0000007fb0eb800
  89. GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
  90. GPR20: c0002006e7c743e8 c0002006e7c743c8 c0002006e7c743d8 0000000010624dd3
  91. GPR24: 0000000000000001 0000000000000001 c0000007fb06a000 c0002006e7c750e8
  92. GPR28: c0002006e7bf5000 c0002006e7c743a8 c0000007fb06a7e0 c0000007fb06a000
  93. [ 707.642132] NIP [c000000000794530] pci_disable_device+0x140/0x170
  94. [ 707.647682] LR [c00000000079452c] pci_disable_device+0x13c/0x170
  95. [ 707.654605] Call Trace:
  96. [ 707.657383] [c0000007f30ef960] [c00000000079452c] pci_disable_device+0x13c/0x170 (unreliable)
  97. [ 707.665713] [c0000007f30ef9d0] [c008000018b9bb1c] lpfc_disable_pci_dev+0x54/0x80 [lpfc]
  98. [ 707.672651] [c0000007f30efa00] [c008000018baecb4] lpfc_pci_remove_one+0x86c/0xa70 [lpfc]
  99. [ 707.680958] [c0000007f30efad0] [c00000000079a75c] pci_device_remove+0x6c/0x120
  100. [ 707.687901] [c0000007f30efb10] [c00000000089a204] device_release_driver_internal+0x294/0x380
  101. [ 707.696221] [c0000007f30efb60] [c00000000078ddd8] pci_stop_bus_device+0x98/0xe0
  102. [ 707.704533] [c0000007f30efba0] [c00000000078dfc8] pci_stop_and_remove_bus_device+0x28/0x40
  103. [ 707.712855] [c0000007f30efbd0] [c000000000061640] pci_hp_remove_devices+0x90/0x130
  104. [ 707.719798] [c0000007f30efc60] [c000000000041600] eeh_handle_normal_event+0x280/0x680
  105. [ 707.728110] [c0000007f30efd10] [c000000000042030] eeh_event_handler+0x130/0x1e0
  106. [ 707.735052] [c0000007f30efdc0] [c000000000143808] kthread+0x1a8/0x1b0
  107. [ 707.741979] [c0000007f30efe30] [c00000000000b65c] ret_from_kernel_thread+0x5c/0x80
  108. [ 707.748918] Instruction dump:
  109. [ 707.751697] 992a8db4 f8010080 480fde61 60000000 e8bf00e8 7c641b78 2fa50000 419e0030
  110. [ 707.760009] 3c62ff8b 3863d798 4b97e061 60000000 <0fe00000> e8010080 7c0803a6 4bfffef0
  111. [ 707.766952] ---[ end trace e4a3a385a4cfeb7b ]---
  112.  
  113.  
  114. 4. And one last information: on both scenarios where the crash happens and when we see the trace above, the following messages showed up in the log:
  115.  
  116. [ 703.562832] eeh_handle_normal_event: Cannot reset, err=-5
  117. [ 703.562871] EEH: Unable to recover from failure from PHB#34-PE#0.
  118. Please try reseating or replacing it
  119. [ 703.569765] pnv_ioda_unfreeze_pe: Failure -6 clear 1 on PHB#34-PE#0
  120. [ 703.569812] eeh_pci_enable: Unexpected state change 2 on PHB#34-PE#0, err=-5
  121.  
  122.  
  123. But, I don't know whether these messages impact our debugging or not.
  124.  
  125. Any suggestions about how to continue debugging/investigating this bug are very welcome!
  126.  
  127.  
  128. Thanks,
  129. Rodrigo
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement