LUN locking up on VMware
Posted: Sun May 10, 2020 11:33 am
Hi everyone,
I'm having some issues with VSAN serving up an iSCSI LUN to a VMware ESXi 6.7 server. The VSAN VM resides on the same physical server it is serving the LUN to. The VM has two NICs configured as storage NICs, these are passed through a Standard vSwitch to two physical NICs on the host (which are also dedicated storage NICs).
Specs of the setup:
Virtual SAN VM OS: Windows Server 2016 (Core) (have tried 2019, same result)
Virtual SAN version: Starwind v8.0.0 build 13481 (have tried 12767, same result)
ESX version: 6.7
Storage disks: 4x 1TB 7.2K Enterprise 2.5" disks in RAID-10 (served as a VMFS-5 volume -> 1.5TB VMDK attached to the VSAN VM)
Virtual SAN VM OS disk: stored in the same datastore as the Storage disks
Cache disk: Kingston HyperX PCI-E 256GB SSD (served as a VMFS-5 volume -> 200GB VMDK attached to the VSAN VM)
RAM cache: 3GB
Server: HP DL360e Gen8
RAID card: HP P420
Within the Starwind VM, I have three disks:
C: - NTFS, OS, located on the "Storage Disks" described above
D: - NTFS, The actual LUN storage location, located on the "Storage Disks" described above
E: - NTFS, The Cache SSD storage, located on the "Cache Disk" described above
Basically when the storage is under significant load (e.g. if I start all the VMs at once or one starts performing a lot of writes/reads), the LUN completely locks up and stops responding for a period of time. The CPU is fairly idle and the RAM has room to breathe as well.
When I was running Windows Server 2019, the D: volume that holds the VSAN LUN would go to 100% active time in the Task Manager, while showing zero read and write operations occuring (the "Frozen disk" screenshot attached to my post shows this). The Event Viewer -> System Log within Windows would show stacks of errors around the disks when this happened as well (unfortunately since I've rebuilt the VM to Server 2016 I no longer have these handy).
Running Server 2016, the disk does not show 100% active time, however the read and writes do fall right down to zero. I have confirmed the D: volume is unresponsive within the OS itself, but strangely this did not affect the C: volume of the VM and it would operate completely fine in that regard, so I don't believe the problem here is the physical disks or the RAID card. The event logs do not show anything useful in Server 2016.
If I restart the VSAN service, everything will start working again for a short amount of time, then lock up again.
In the Starwind Server logs, I can see the following message repeated over and over:
5/10 21:07:22.552335 b68 IMG: *** ImageFile_ReadWriteSectorsCompleted: Error occured (ScsiStatus = 2, DataTransferLength = 0)!
5/10 21:07:22.552343 b64 IMG: *** ImageFile_ReadWriteSectorsCompleted: Error occured (ScsiStatus = 2, DataTransferLength = 0)!
5/10 21:07:22.552356 b68 IMG: *** ImageFile_IoCompleted: Disk operation failed. Disk path: D:\VSAN\LS-72K-Data1.img. Error code: (19).
5/10 21:07:22.552362 b64 IMG: *** ImageFile_IoCompleted: Disk operation failed. Disk path: D:\VSAN\LS-72K-Data1.img. Error code: (19).
Things I've tried:
- Virtual SAN version 12767 and 13481 (currently on 13481)
- Windows Server 2019 (this showed the Active Time as 100% and zero read/write activity when the LUN would lock up)
- Removing the SSD cache
- Removing the RAM cache
- Switching to the VMware Paravirtual storage adapter from the default LSI (currently using this)
I'm a little unsure as to what could possibly be wrong as I've been running VSAN on Hyper-V for about 4 years now in multiple instances without any hassles.
I'm not really sure where to go from here and I am using the free version so I can't lean on support in this case. Any help is greatly appreciated, otherwise I may have to fall back to using the disks directly in VMware without VSAN.
I'm having some issues with VSAN serving up an iSCSI LUN to a VMware ESXi 6.7 server. The VSAN VM resides on the same physical server it is serving the LUN to. The VM has two NICs configured as storage NICs, these are passed through a Standard vSwitch to two physical NICs on the host (which are also dedicated storage NICs).
Specs of the setup:
Virtual SAN VM OS: Windows Server 2016 (Core) (have tried 2019, same result)
Virtual SAN version: Starwind v8.0.0 build 13481 (have tried 12767, same result)
ESX version: 6.7
Storage disks: 4x 1TB 7.2K Enterprise 2.5" disks in RAID-10 (served as a VMFS-5 volume -> 1.5TB VMDK attached to the VSAN VM)
Virtual SAN VM OS disk: stored in the same datastore as the Storage disks
Cache disk: Kingston HyperX PCI-E 256GB SSD (served as a VMFS-5 volume -> 200GB VMDK attached to the VSAN VM)
RAM cache: 3GB
Server: HP DL360e Gen8
RAID card: HP P420
Within the Starwind VM, I have three disks:
C: - NTFS, OS, located on the "Storage Disks" described above
D: - NTFS, The actual LUN storage location, located on the "Storage Disks" described above
E: - NTFS, The Cache SSD storage, located on the "Cache Disk" described above
Basically when the storage is under significant load (e.g. if I start all the VMs at once or one starts performing a lot of writes/reads), the LUN completely locks up and stops responding for a period of time. The CPU is fairly idle and the RAM has room to breathe as well.
When I was running Windows Server 2019, the D: volume that holds the VSAN LUN would go to 100% active time in the Task Manager, while showing zero read and write operations occuring (the "Frozen disk" screenshot attached to my post shows this). The Event Viewer -> System Log within Windows would show stacks of errors around the disks when this happened as well (unfortunately since I've rebuilt the VM to Server 2016 I no longer have these handy).
Running Server 2016, the disk does not show 100% active time, however the read and writes do fall right down to zero. I have confirmed the D: volume is unresponsive within the OS itself, but strangely this did not affect the C: volume of the VM and it would operate completely fine in that regard, so I don't believe the problem here is the physical disks or the RAID card. The event logs do not show anything useful in Server 2016.
If I restart the VSAN service, everything will start working again for a short amount of time, then lock up again.
In the Starwind Server logs, I can see the following message repeated over and over:
5/10 21:07:22.552335 b68 IMG: *** ImageFile_ReadWriteSectorsCompleted: Error occured (ScsiStatus = 2, DataTransferLength = 0)!
5/10 21:07:22.552343 b64 IMG: *** ImageFile_ReadWriteSectorsCompleted: Error occured (ScsiStatus = 2, DataTransferLength = 0)!
5/10 21:07:22.552356 b68 IMG: *** ImageFile_IoCompleted: Disk operation failed. Disk path: D:\VSAN\LS-72K-Data1.img. Error code: (19).
5/10 21:07:22.552362 b64 IMG: *** ImageFile_IoCompleted: Disk operation failed. Disk path: D:\VSAN\LS-72K-Data1.img. Error code: (19).
Things I've tried:
- Virtual SAN version 12767 and 13481 (currently on 13481)
- Windows Server 2019 (this showed the Active Time as 100% and zero read/write activity when the LUN would lock up)
- Removing the SSD cache
- Removing the RAM cache
- Switching to the VMware Paravirtual storage adapter from the default LSI (currently using this)
I'm a little unsure as to what could possibly be wrong as I've been running VSAN on Hyper-V for about 4 years now in multiple instances without any hassles.
I'm not really sure where to go from here and I am using the free version so I can't lean on support in this case. Any help is greatly appreciated, otherwise I may have to fall back to using the disks directly in VMware without VSAN.