Advertisement
TheFluffyAdmin

EMC KB207382 VPLEX ATS Miscompare issue

Nov 6th, 2015
1,436
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 14.00 KB | None | 0 0
  1. VPLEX: Random temporary loss of connection to storage devices and/or performance degradation on ESXi hosts from version 5.5 u2
  2.  
  3. Article Number:000207382 Version:1
  4.  
  5. Key Information
  6.  
  7. Audience: Level 30 = Customers Article Type: Break Fix
  8. Last Published: Tue Oct 20 08:48:07 GMT 2015 Validation Status: Technically Approved
  9. Summary: Due to an ATS miscompare on an VMFS HeartBeat slot the ESXi host attempts to regain control of the device. To do this the host issues a SCSI device reset on the LUN holding the VMFS. All active IO on this LUN will be aborted and the SCSI device will be reset. A temporary loss in connectivity will show up in the VMKernel logs.
  10.  
  11.  
  12.  
  13. Article Content
  14. Impact The ESXi host(s) loses connection to the VMFS datastore for a short period of time. Any VM's on the datastore may crash or have IO errors during this.
  15. Issue An ESXi host request to take a heartbeat lock using Compare & Write (SCSI Operation code 89) on a VMFS3 or VMFS5 datastore fails due to a "(ATS) Miscompare during verify operation".
  16.  
  17. Due to this ATS (Atomic Test & Set) Miscompare on an VMFS HeartBeat slot the ESXi host attempts to regain control of the device. To do this the host issues a SCSI device reset on the LUN holding the VMFS.
  18. All active IO on this LUN will be aborted and the SCSI device will be reset.
  19.  
  20. ATS Miscompare can happen both with NMP and PowerPath.
  21.  
  22.  
  23. You will see events similar to the following in the host VMKernel log:
  24.  
  25. 2015-09-30T22:13:55.516Z cpu1:33645)ScsiDeviceIO: 2338: Cmd(0x413686250680) 0x89, CmdSN 0x12b from world 32949 to dev "naa.60001440000000XXXXXXXXXXXXXXXXXX" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0xe 0x1d 0x0.
  26. For hosts with NMP you may also see:
  27.  
  28. 2015-09-30T22:13:55.516Z cpu1:33645)NMP: nmp_ThrottleLogForDevice:2321: Cmd 0x89 (0x413686250680, 32949) to dev "naa.60001440000000XXXXXXXXXXXXXXXXXX" on path "vmhba2:C0:T5:L13" Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0xe 0x1d 0x0. Act:NONE
  29.  
  30.  
  31. These events mean that the VPLEX returns sense data 0E/1D/00 to the host (MISCOMPARE DURING VERIFY OPERATION) in response to SCSI Operation Code 89 (COMPARE AND WRITE), which is used by VMware hardware-assisted locking mechanism (ATS).
  32.  
  33. Atomic Test & Set (ATS)
  34. This is a replacement lock mechanism for SCSI reservations on VMFS volumes when doing metadata updates. Basically ATS locks can be considered as a mechanism to modify a disk sector, which when successful, allows an ESXi host to do a metadata update on a VMFS. This includes allocating space to a VMDK during provisioning, as certain characteristics would need to be updated in the metadata to reflect the new size of the file. Interestingly enough, in the initial VAAI release, the ATS primitives had to be implemented differently on each storage array, so you had a different ATS opcode depending on the vendor. ATS is now a standard T10 and uses opcode 0x89 (COMPARE AND WRITE).
  35. For VMFS5 datastores formatted on a VAAI-enabled array, heartbeat locking is done using ATS. There no longer should be any SCSI reservations on VAAI-enabled VMFS5. ATS continues to be used even if there is contention. On non-VAAI arrays, SCSI reservations continue to be used for establishing critical sections in VMFS5.
  36.  
  37.  
  38.  
  39. The following events are seen on the VPLEX Firmware logs:
  40. Host HBAs logging out (tach/38, stdf/18) and logging back in (tach/37, stdf/17):
  41. 128.221.252.67/cpu0/log:5988:W/"0060165f237510728-2":23768:<6>2015/09/18 06:53:24.45: tach/38 tach01 (A0-FC01): login with 0x1234567890123456 (nPortId 0x012345) type TGT is closing.
  42. 128.221.252.68/cpu0/log:5988:W/"0060165e9a38102140-2":46706:<6>2015/09/18 06:53:24.45: tach/38 tach01 (B0-FC01): login with 0x1234567890123456 (nPortId 0x012345) type TGT is closing.
  43. 128.221.252.68/cpu0/log:5988:W/"0060165e9a38102140-2":46707:<4>2015/09/18 06:53:24.45: stdf/18 FCP connection lost. IT: [EXAMPLESERVER_HBA1 (0x1234567890123456) B0-FC01 (0x50001442b0353d01)]
  44. 128.221.252.67/cpu0/log:5988:W/"0060165f237510728-2":23769:<4>2015/09/18 06:53:24.45: stdf/18 FCP connection lost. IT: [EXAMPLESERVER_HBA1 (0x1234567890123456) A0-FC01 (0x50001442a0353d01)]
  45.  
  46. 128.221.252.68/cpu0/log:5988:W/"0060165e9a38102140-2":46708:<6>2015/09/18 06:53:24.45: tach/37 tach01 (B0-FC01): login with 0x1234567890123456 (nPortId 0x012345) type TGT is ready.
  47. 128.221.252.67/cpu0/log:5988:W/"0060165f237510728-2":23770:<6>2015/09/18 06:53:24.45: tach/37 tach01 (A0-FC01): login with 0x1234567890123456 (nPortId 0x012345) type TGT is ready.
  48. 128.221.252.67/cpu0/log:5988:W/"0060165f237510728-2":23771:<4>2015/09/18 06:53:24.45: stdf/17 FCP connection established. IT: [EXAMPLESERVER_HBA1 (0x1234567890123456) A0-FC01 (0x50001442a0353d01)]
  49. 128.221.252.68/cpu0/log:5988:W/"0060165e9a38102140-2":46709:<4>2015/09/18 06:53:24.45: stdf/17 FCP connection established. IT: [EXAMPLESERVER_HBA1 (0x1234567890123456) B0-FC01 (0x50001442b0353d01)]
  50.  
  51. Registered State Change Notification (RSCN) Received (tach/42), due to the Host HBA resets (logouts/logins), preceded by the string "TGT_LGN_FROM_UNKNOWN_NPID":
  52. 128.221.252.68/cpu0/log:5988:W/"0060165e9a38102140-2":46710:<6>2015/09/18 06:53:26.50: tach/42 tach00 (B0-FC00): finished discovery in 58.650 msec, reason to start: TGT_LGN_FROM_UNKNOWN_NPID:RSCN_RECEIVE
  53. D, result: succeeded
  54. 128.221.252.67/cpu0/log:5988:W/"0060165f237510728-2":23772:<6>2015/09/18 06:53:26.52: tach/42 tach00 (A0-FC00): finished discovery in 61.626 msec, reason to start: TGT_LGN_FROM_UNKNOWN_NPID:RSCN_RECEIVED
  55. , result: succeeded
  56.  
  57. Normal RSCN_RECEIVED messages (tach/42) are also expected:
  58. 128.221.252.68/cpu0/log:5988:W/"0060165e9a38102140-2":46711:<6>2015/09/18 06:53:26.64: tach/42 tach01 (B0-FC01): finished discovery in 63.309 msec, reason to start: RSCN_RECEIVED, result: succeeded
  59. 128.221.252.67/cpu0/log:5988:W/"0060165f237510728-2":23773:<6>2015/09/18 06:53:26.66: tach/42 tach01 (A0-FC01): finished discovery in 62.665 msec, reason to start: RSCN_RECEIVED, result: succeeded
  60.  
  61. SCSI Operation Code 89 (Compare & Write) Host aborts (stdf/10 with status code starting with "89"):
  62. firmware.log_20150123073924.1:128.221.252.67/cpu0/log:5988:W/"0060165f237510728-2":23588:<6>2015/09/17 11:35:43.15: stdf/10 Scsi Tmf [Abort Task] on fcp ITLQ: [EXAMPLESERVER_HBA1 (0x1234567890123456) A0-FC01 (0x50001442a0353d01) 0x7d000000000000 0x2ed] vol VIRTUAL_VOLUME_NAME_vol taskElapsedTime(usec) 7996921 dormantQCnt 5 enabledQCnt 1 status 8900000000000100:0
  63. firmware.log_20150123073924.1:128.221.252.67/cpu0/log:5988:W/"0060165f237510728-2":23589:<6>2015/09/17 11:35:44.15: stdf/10 Scsi Tmf [Abort Task] on fcp ITLQ: [EXAMPLESERVER_HBA1 (0x1234567890123456) A0-FC01 (0x50001442a0353d01) 0x7d000000000000 0x55d] vol VIRTUAL_VOLUME_NAME_vol taskElapsedTime(usec) 929473 dormantQCnt 6 enabledQCnt 1 status 8900000000000100:0
  64.  
  65.  
  66. Events explained:
  67. ​tach/38: An FC login is closing.
  68. stdf/18: This log message is generated whenever a FCP initiator port's connection to a target port is lost, due to logout or departure from the fabric.
  69. tach/37: An FC login is ready to serve IO.
  70. tach/42: Summary of an FC discovery completed recently.
  71. stdf/10: Host aborting an IO. Hosts will escalate to "Logical Unit Reset" and "Target Reset" TMFs if they remain unhappy.
  72. Environment EMC Hardware: VPLEX Series
  73. EMC Hardware: VPLEX VS1
  74. EMC Hardware: VPLEX VS2
  75. EMC Hardware: VPLEX-Local
  76. EMC Hardware: VPLEX-Metro
  77. EMC Hardware: VPLEX-Geo
  78.  
  79. EMC Software: GeoSynchrony 5.0
  80. EMC Software: GeoSynchrony 5.0.1
  81. EMC Software: GeoSynchrony 5.1
  82. EMC Software: GeoSynchrony 5.1 Patch 1
  83. EMC Software: GeoSynchrony 5.1 Patch 2
  84. EMC Software: GeoSynchrony 5.1 Patch 3
  85. EMC Software: GeoSynchrony 5.1 Patch 4
  86. EMC Software: GeoSynchrony 5.2
  87. EMC Software: GeoSynchrony 5.2 Patch 1
  88. EMC Software: GeoSynchrony 5.2 Service Pack 1
  89. EMC Software: GeoSynchrony 5.2 Service Pack 1 Patch 1
  90. EMC Software: GeoSynchrony 5.2 Service Pack 1 Patch 2
  91. EMC Software: GeoSynchrony 5.2 Service Pack 1 Patch 3
  92. EMC Software: GeoSynchrony 5.3
  93. EMC Software: GeoSynchrony 5.3 Patch 1
  94. EMC Software: GeoSynchrony 5.3 Patch 2
  95. EMC Software: GeoSynchrony 5.3 Patch 3
  96. EMC Software: GeoSynchrony 5.3 Patch 4
  97. EMC Software: GeoSynchrony 5.4
  98. EMC Software: GeoSynchrony 5.4 Service Pack 1
  99. EMC Software: GeoSynchrony 5.4 Service Pack 1 Patch 1
  100.  
  101. Third Party Software: VMware ESXi 5.5 u2
  102. Third Party Software: VMware ESXi 6.0
  103. Cause VMware vSphere version 5.5.0 Update 2 (build 2068190) and vSphere 6.0 uses Atomic Test & Set (ATS) for VMFS heartbeat locking. Prior to version 5.5.0 U2, SCSI-2 non-persistent reservations were used for this purpose.
  104.  
  105. A host indicates its liveness by periodically performing I/O to its heartbeat on a given volume. Therefore, if no activity is seen on the host's heartbeat slot for a period of time, then we can conclude that the host has lost connectivity to the volume.
  106.  
  107. ATS heartbeat I/O has a very low time-out value that can lead to host disconnects and application outages, translating in connection loss to disks and/or performance degradation on the hosts.
  108.  
  109. The host then registers the miscompare on the heartbeat slot and aborts all active IO on the LUN as it issues the reset. All pending IO on this LUN will fail with host sense 8 (H:0x8 SCSI reset).
  110.  
  111. Example messages from ESXi host using NMP:
  112.  
  113. 2015-10-01T00:31:00.333Z cpu9:33645)NMP: nmp_ThrottleLogForDevice:2321: Cmd 0x89 (0x412e82aeed40, 32805) to dev "naa.60001440000000XXXXXXXXXXXXXXXXXX" on path "vmhba2:C0:T5:L10" Failed: H:0x8 D:0x0 P:0x0 Possible sense data: 0x5 0x20 0x0. Act:EVAL
  114. 2015-10-01T00:31:00.333Z cpu9:33645)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device "naa.60001440000000XXXXXXXXXXXXXXXXXX" state in doubt; requested fast path state update...
  115. 2015-10-01T00:31:00.333Z cpu9:33645)ScsiDeviceIO: 2338: Cmd(0x412e82aeed40) 0x89, CmdSN 0x72b97 from world 32805 to dev "naa.60001440000000XXXXXXXXXXXXXXXXXX" failed H:0x8 D:0x0 P:0x0 Possible sense data: 0x5 0x20 0x0.
  116. 2015-10-01T00:31:01.333Z cpu9:33645)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device "naa.60001440000000XXXXXXXXXXXXXXXXXX" state in doubt; requested fast path state update...
  117. 2015-10-01T00:31:01.333Z cpu9:33645)ScsiDeviceIO: 2338: Cmd(0x413686ad0b80) 0x89, CmdSN 0x72b9a from world 32805 to dev "naa.60001440000000XXXXXXXXXXXXXXXXXX" failed H:0x8 D:0x0 P:0x0 Possible sense data: 0x5 0x24 0x0.
  118. 2015-10-01T00:31:01.406Z cpu9:33645)ScsiDeviceIO: 2307: Cmd(0x41368008ee80) 0x2a, CmdSN 0x8000005d from world 1655292 to dev "naa.60001440000000XXXXXXXXXXXXXXXXXX" failed H:0x8 D:0x0 P:0x0
  119. 2015-10-01T00:31:01.406Z cpu9:33645)ScsiDeviceIO: 2307: Cmd(0x413686778800) 0x2a, CmdSN 0x8000004d from world 1655292 to dev "naa.60001440000000XXXXXXXXXXXXXXXXXX" failed H:0x8 D:0x0 P:0x0
  120. 2015-10-01T00:31:01.406Z cpu9:33645)ScsiDeviceIO: 2307: Cmd(0x4136838cc140) 0x2a, CmdSN 0x80000049 from world 1655292 to dev "naa.60001440000000XXXXXXXXXXXXXXXXXX" failed H:0x8 D:0x0 P:0x0
  121. 2015-10-01T00:31:01.608Z cpu9:33645)ScsiDeviceIO: 2307: Cmd(0x4136848c5c00) 0x2a, CmdSN 0x80000065 from world 1655292 to dev "naa.60001440000000XXXXXXXXXXXXXXXXXX" failed H:0x8 D:0x0 P:0x0
  122. 2015-10-01T00:31:01.609Z cpu9:33645)ScsiDeviceIO: 2307: Cmd(0x4136836fde80) 0x2a, CmdSN 0x8000002c from world 1655292 to dev "naa.60001440000000XXXXXXXXXXXXXXXXXX" failed H:0x8 D:0x0 P:0x0
  123. 2015-10-01T00:31:01.811Z cpu9:33645)NMP: nmp_ThrottleLogForDevice:2321: Cmd 0x2a (0x4136804206c0, 1655292) to dev "naa.60001440000000XXXXXXXXXXXXXXXXXX" on path "vmhba2:C0:T5:L10" Failed: H:0x8 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0. Act:EVAL
  124. 2015-10-01T00:31:02.014Z cpu9:33645)ScsiDeviceIO: 2307: Cmd(0x4136848cb740) 0x28, CmdSN 0x72b98 from world 34950 to dev "naa.60001440000000XXXXXXXXXXXXXXXXXX" failed H:0x8 D:0x0 P:0x0
  125. 2015-10-01T00:31:02.014Z cpu18:34950)HBX: 2832: Waiting for timed out [HB state abcdef02 offset 4161536 gen 297 stampUS 933180151199 uuid 551234ba-5123418f-0123-7123457d566e jrnl <FB 15038> drv 14.60] on vol 'VPLEX-VOLUME-NAME'
  126. 2015-10-01T00:31:02.015Z cpu9:33645)ScsiDeviceIO: 2307: Cmd(0x41368386e100) 0x2a, CmdSN 0x72b99 from world 32805 to dev "naa.60001440000000XXXXXXXXXXXXXXXXXX" failed H:0x8 D:0x0 P:0x0
  127. 2015-10-01T00:31:05.675Z cpu9:33039)VMW_SATP_INV: satp_inv_UpdatePath:754: Failed to update path "vmhba3:C0:T5:L10" state. Status Transient storage condition, suggest retry
  128.  
  129.  
  130. Note: The host reports connectivity issues to the device. This is not a physical connection issue it is a result of a single LUN reset from the host. There will be no path loss to the storage.
  131. Change Host upgrade to ESXi version 5.5.0 Update 2 (build 2068190) or higher.
  132. Resolution There is no workaround at this time from an EMC array perspective. Customer can engage VMware for confirmation of the issue or provide an ESXi EMCgrab with vmsupport for confirmation as per KB 15034 (https://support.emc.com/kb/15034). Currently disabling the VAAI ATS HeartBeat functionality on the ESXi server is being recommend for affected customers.
  133.  
  134. See VMware KB 2113956 for more information: http://kb.vmware.com/kb/2113956.
  135.  
  136. Disabling this will revert the host back to SCSI-2 reservations legacy mode.
  137. Notes Please note: This solution only applies if the host is getting a miscompare (sense data 0E/1D/00) returned from the VPLEX on a SCSI OpCode 0x89 command (Compare & Write) which is used by VAAI ATS HeartBeat. If the 0x89 (Compare & Write) command is failing for other reasons (such as timeouts, host aborting, etc), then this solution does not apply and it is recommended EMC support be engaged.
  138.  
  139. Product VPLEX Series, VPLEX Geo, VPLEX Local, VPLEX Metro, VPLEX VS1, VPLEX VS2, VPLEX Virtual Edition,VPLEX GeoSynchrony5.1,VPLEX GeoSynchrony5.1 Patch 1,VPLEX GeoSynchrony5.1 Patch 2,VPLEX GeoSynchrony5.1 Patch 3,VPLEX GeoSynchrony5.1 Patch 4,VPLEX GeoSynchrony5.2,VPLEX GeoSynchrony5.2 Patch 1,VPLEX GeoSynchrony5.2 Service Pack 1,VPLEX GeoSynchrony5.2 Service Pack 1 Patch 1,VPLEX GeoSynchrony5.2 Service Pack 1 Patch 2,VPLEX GeoSynchrony5.2 Service Pack 1 Patch 3,VPLEX GeoSynchrony5.3,VPLEX GeoSynchrony5.3 Patch 1,VPLEX GeoSynchrony5.3 Patch 2,VPLEX GeoSynchrony5.3 Patch 3,VPLEX GeoSynchrony5.3 Patch 4,VPLEX GeoSynchrony5.4,VPLEX GeoSynchrony5.4 Service Pack 1,VPLEX GeoSynchrony5.4 Service Pack 1 Patch 1
  140. Requested Publish Date 9/14/2015 5:27 AM
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement