Advertisement
Not a member of Pastebin yet?
Sign Up,
it unlocks many cool features!
- Hey,
- I've been taking a look at this bug in the last couple of days. Here are some info:
- 1. I could reproduce it in our system with Ubuntu 18.04 and current upstream kernel (v4.18-rc4)
- (Logs will be attached next)
- 2. I did a simple patch that "avoid" the crash, but I'm concerned about other trace that's showing up.
- Explaining the patch: at some point, the fcp_wq data structure (lpfc_queue) is getting a NULL value. Probably, during the recovery process due to the EEH injection.
- So, I just check it at the beginning of the function:
- ## Patch ##
- ---------------------------------------------------------------------
- --- a/drivers/scsi/lpfc/lpfc_sli.c
- +++ b/drivers/scsi/lpfc/lpfc_sli.c
- @@ -3981,6 +3981,11 @@ lpfc_sli_flush_fcp_rings(struct lpfc_hba *phba)
- phba->hba_flag |= HBA_FCP_IOQ_FLUSH;
- spin_unlock_irq(&phba->hbalock);
- + if (unlikely(!phba->sli4_hba.fcp_wq)) {
- + printk("lpfc_sli_flush_fcp_rings -- unlikely(!phba->sli4_hba.fcp_wq) -- rrg\n");
- + return;
- + }
- +
- /* Look on all the FCP Rings for the iotag */
- if (phba->sli_rev >= LPFC_SLI_REV4) {
- for (i = 0; i < phba->cfg_fcp_io_channel; i++) {
- ----------------------------------------------------------------------
- 3. After applying the patch, the system doesn't crash, but I see the following trace:
- [ 707.437261] ------------[ cut here ]------------
- [ 707.437313] Trying to free already-free IRQ 89
- [ 707.437363] WARNING: CPU: 34 PID: 782 at kernel/irq/manage.c:1583 __free_irq+0xe0/0x420
- [ 707.437417] Modules linked in: xt_CHECKSUM iptable_mangle ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp bridge stp llc ebtable_filter ebtables devlink ip6table_filter ip6_tables iptable_filter kvm_hv kvm input_leds joydev mac_hid at24 uio_pdrv_genirq ofpart cmdlinepart powernv_flash ipmi_powernv ipmi_devintf sch_fq_codel uio ipmi_msghandler ib_iser ibmpowernv mtd rdma_cm opal_prd iw_cm vmx_crypto ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear hid_generic usbhid hid ses enclosure scsi_transport_sas ast i2c_algo_bit
- [ 707.437916] ttm drm_kms_helper lpfc syscopyarea sysfillrect sysimgblt fb_sys_fops drm qla2xxx nvmet_fc nvmet nvme_fc nvme_fabrics i40e aacraid megaraid_sas crct10dif_vpmsum crc32c_vpmsum scsi_transport_fc drm_panel_orientation_quirks
- [ 707.438063] CPU: 34 PID: 782 Comm: eehd Not tainted 4.18.0-rc4-rosattig+ #2
- [ 707.438109] NIP: c00000000019b530 LR: c00000000019b52c CTR: 000000003003dfbc
- [ 707.438162] REGS: c0000007f30ef680 TRAP: 0700 Not tainted (4.18.0-rc4-rosattig+)
- [ 707.438215] MSR: 9000000000029033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: 48044222 XER: 20040000
- [ 707.438290] CFAR: c0000000001125f0 IRQMASK: 1
- GPR00: c00000000019b52c c0000007f30ef900 c00000000178be00 0000000000000022
- GPR04: 0000000000000001 0000000000000dd5 9000000000009033 0000000031d90058
- GPR08: 0000000000000007 0000000000000007 0000000000000001 9000000000001003
- GPR12: 0000000000004000 c0000007fffd8c00 c000000000143668 c0000007fb0eb800
- GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
- GPR20: c0002006e7c743e8 c0002006e7c743c8 c0002006e7c743d8 0000000010624dd3
- GPR24: 0000000000000001 0000000000000000 0000000000000059 c0000007f34d0d8c
- GPR28: c0000007f34d0e40 c0002006e7c74000 c0000007f34d0c00 c0002006e7c74000
- [ 707.438746] NIP [c00000000019b530] __free_irq+0xe0/0x420
- [ 707.438784] LR [c00000000019b52c] __free_irq+0xdc/0x420
- [ 707.438819] Call Trace:
- [ 707.438840] [c0000007f30ef900] [c00000000019b52c] __free_irq+0xdc/0x420 (unreliable)
- [ 707.438895] [c0000007f30ef9a0] [c00000000019b978] free_irq+0x78/0xe0
- [ 707.438952] [c0000007f30ef9d0] [c008000018b9a4bc] lpfc_sli4_disable_intr+0xd4/0x170 [lpfc]
- [ 707.439017] [c0000007f30efa00] [c008000018baebc8] lpfc_pci_remove_one+0x780/0xa70 [lpfc]
- [ 707.439073] [c0000007f30efad0] [c00000000079a75c] pci_device_remove+0x6c/0x120
- [ 707.439129] [c0000007f30efb10] [c00000000089a204] device_release_driver_internal+0x294/0x380
- [ 707.439193] [c0000007f30efb60] [c00000000078ddd8] pci_stop_bus_device+0x98/0xe0
- [ 707.439248] [c0000007f30efba0] [c00000000078dfc8] pci_stop_and_remove_bus_device+0x28/0x40
- [ 707.439303] [c0000007f30efbd0] [c000000000061640] pci_hp_remove_devices+0x90/0x130
- [ 707.439359] [c0000007f30efc60] [c000000000041600] eeh_handle_normal_event+0x280/0x680
- [ 707.439414] [c0000007f30efd10] [c000000000042030] eeh_event_handler+0x130/0x1e0
- [ 707.439468] [c0000007f30efdc0] [c000000000143808] kthread+0x1a8/0x1b0
- [ 707.439515] [c0000007f30efe30] [c00000000000b65c] ret_from_kernel_thread+0x5c/0x80
- [ 707.439568] Instruction dump:
- [ 707.439597] e93f0008 7fa9e840 419e0098 7feafb78 ebea0018 2fbf0000 409effe8 3c62ff86
- [ 707.439670] 7f44d378 3863d1d8 4bf77061 60000000 <0fe00000> 7f24cb78 7f63db78 48bca12d
- [ 707.439728] ---[ end trace e4a3a385a4cfeb7a ]---
- [ 707.439816] lpfc 0034:01:00.1: 1:3317 HBA not functional: IP Reset Failed try: echo fw_reset > board_mode
- [ 707.442211] ------------[ cut here ]------------
- [ 707.442251] lpfc 0034:01:00.1: disabling already-disabled device
- [ 707.442272] WARNING: CPU: 34 PID: 782 at drivers/pci/pci.c:1658 pci_disable_device+0x140/0x170
- [ 707.450750] Modules linked in: xt_CHECKSUM iptable_mangle ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp bridge stp llc ebtable_filter ebtables devlink ip6table_filter ip6_tables iptable_filter kvm_hv kvm input_leds joydev mac_hid at24 uio_pdrv_genirq ofpart cmdlinepart powernv_flash ipmi_powernv ipmi_devintf sch_fq_codel uio ipmi_msghandler ib_iser ibmpowernv mtd rdma_cm opal_prd iw_cm vmx_crypto ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear hid_generic usbhid hid ses enclosure scsi_transport_sas ast i2c_algo_bit
- [ 707.520085] ttm drm_kms_helper lpfc syscopyarea sysfillrect sysimgblt fb_sys_fops drm qla2xxx nvmet_fc nvmet nvme_fc nvme_fabrics i40e aacraid megaraid_sas crct10dif_vpmsum crc32c_vpmsum scsi_transport_fc drm_panel_orientation_quirks
- [ 707.540883] CPU: 34 PID: 782 Comm: eehd Tainted: G W 4.18.0-rc4-rosattig+ #2
- [ 707.549207] NIP: c000000000794530 LR: c00000000079452c CTR: c000000000d5d6e0
- [ 707.557528] REGS: c0000007f30ef6e0 TRAP: 0700 Tainted: G W (4.18.0-rc4-rosattig+)
- [ 707.565851] MSR: 9000000000029033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: 48044222 XER: 20040000
- [ 707.574170] CFAR: c0000000001125f0 IRQMASK: 0
- GPR00: c00000000079452c c0000007f30ef960 c00000000178be00 0000000000000034
- GPR04: 0000000000000001 0000000000000df3 65642064656c6261 0000000000000000
- GPR08: 0000000000000007 0000000000000007 0000000000000001 9000000000001003
- GPR12: 0000000000004000 c0000007fffd8c00 c000000000143668 c0000007fb0eb800
- GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
- GPR20: c0002006e7c743e8 c0002006e7c743c8 c0002006e7c743d8 0000000010624dd3
- GPR24: 0000000000000001 0000000000000001 c0000007fb06a000 c0002006e7c750e8
- GPR28: c0002006e7bf5000 c0002006e7c743a8 c0000007fb06a7e0 c0000007fb06a000
- [ 707.642132] NIP [c000000000794530] pci_disable_device+0x140/0x170
- [ 707.647682] LR [c00000000079452c] pci_disable_device+0x13c/0x170
- [ 707.654605] Call Trace:
- [ 707.657383] [c0000007f30ef960] [c00000000079452c] pci_disable_device+0x13c/0x170 (unreliable)
- [ 707.665713] [c0000007f30ef9d0] [c008000018b9bb1c] lpfc_disable_pci_dev+0x54/0x80 [lpfc]
- [ 707.672651] [c0000007f30efa00] [c008000018baecb4] lpfc_pci_remove_one+0x86c/0xa70 [lpfc]
- [ 707.680958] [c0000007f30efad0] [c00000000079a75c] pci_device_remove+0x6c/0x120
- [ 707.687901] [c0000007f30efb10] [c00000000089a204] device_release_driver_internal+0x294/0x380
- [ 707.696221] [c0000007f30efb60] [c00000000078ddd8] pci_stop_bus_device+0x98/0xe0
- [ 707.704533] [c0000007f30efba0] [c00000000078dfc8] pci_stop_and_remove_bus_device+0x28/0x40
- [ 707.712855] [c0000007f30efbd0] [c000000000061640] pci_hp_remove_devices+0x90/0x130
- [ 707.719798] [c0000007f30efc60] [c000000000041600] eeh_handle_normal_event+0x280/0x680
- [ 707.728110] [c0000007f30efd10] [c000000000042030] eeh_event_handler+0x130/0x1e0
- [ 707.735052] [c0000007f30efdc0] [c000000000143808] kthread+0x1a8/0x1b0
- [ 707.741979] [c0000007f30efe30] [c00000000000b65c] ret_from_kernel_thread+0x5c/0x80
- [ 707.748918] Instruction dump:
- [ 707.751697] 992a8db4 f8010080 480fde61 60000000 e8bf00e8 7c641b78 2fa50000 419e0030
- [ 707.760009] 3c62ff8b 3863d798 4b97e061 60000000 <0fe00000> e8010080 7c0803a6 4bfffef0
- [ 707.766952] ---[ end trace e4a3a385a4cfeb7b ]---
- 4. And one last information: on both scenarios where the crash happens and when we see the trace above, the following messages showed up in the log:
- [ 703.562832] eeh_handle_normal_event: Cannot reset, err=-5
- [ 703.562871] EEH: Unable to recover from failure from PHB#34-PE#0.
- Please try reseating or replacing it
- [ 703.569765] pnv_ioda_unfreeze_pe: Failure -6 clear 1 on PHB#34-PE#0
- [ 703.569812] eeh_pci_enable: Unexpected state change 2 on PHB#34-PE#0, err=-5
- But, I don't know whether these messages impact our debugging or not.
- Any suggestions about how to continue debugging/investigating this bug are very welcome!
- Thanks,
- Rodrigo
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement