LUN locking up on VMware

entr0p1 · Sun May 10, 2020 11:33 am

Hi everyone,

I'm having some issues with VSAN serving up an iSCSI LUN to a VMware ESXi 6.7 server. The VSAN VM resides on the same physical server it is serving the LUN to. The VM has two NICs configured as storage NICs, these are passed through a Standard vSwitch to two physical NICs on the host (which are also dedicated storage NICs).

Specs of the setup:
Virtual SAN VM OS: Windows Server 2016 (Core) (have tried 2019, same result)
Virtual SAN version: Starwind v8.0.0 build 13481 (have tried 12767, same result)
ESX version: 6.7
Storage disks: 4x 1TB 7.2K Enterprise 2.5" disks in RAID-10 (served as a VMFS-5 volume -> 1.5TB VMDK attached to the VSAN VM)
Virtual SAN VM OS disk: stored in the same datastore as the Storage disks
Cache disk: Kingston HyperX PCI-E 256GB SSD (served as a VMFS-5 volume -> 200GB VMDK attached to the VSAN VM)
RAM cache: 3GB
Server: HP DL360e Gen8
RAID card: HP P420

Within the Starwind VM, I have three disks:
C: - NTFS, OS, located on the "Storage Disks" described above
D: - NTFS, The actual LUN storage location, located on the "Storage Disks" described above
E: - NTFS, The Cache SSD storage, located on the "Cache Disk" described above

Basically when the storage is under significant load (e.g. if I start all the VMs at once or one starts performing a lot of writes/reads), the LUN completely locks up and stops responding for a period of time. The CPU is fairly idle and the RAM has room to breathe as well.

When I was running Windows Server 2019, the D: volume that holds the VSAN LUN would go to 100% active time in the Task Manager, while showing zero read and write operations occuring (the "Frozen disk" screenshot attached to my post shows this). The Event Viewer -> System Log within Windows would show stacks of errors around the disks when this happened as well (unfortunately since I've rebuilt the VM to Server 2016 I no longer have these handy).

Running Server 2016, the disk does not show 100% active time, however the read and writes do fall right down to zero. I have confirmed the D: volume is unresponsive within the OS itself, but strangely this did not affect the C: volume of the VM and it would operate completely fine in that regard, so I don't believe the problem here is the physical disks or the RAID card. The event logs do not show anything useful in Server 2016.

If I restart the VSAN service, everything will start working again for a short amount of time, then lock up again.

In the Starwind Server logs, I can see the following message repeated over and over:
5/10 21:07:22.552335 b68 IMG: *** ImageFile_ReadWriteSectorsCompleted: Error occured (ScsiStatus = 2, DataTransferLength = 0)!
5/10 21:07:22.552343 b64 IMG: *** ImageFile_ReadWriteSectorsCompleted: Error occured (ScsiStatus = 2, DataTransferLength = 0)!
5/10 21:07:22.552356 b68 IMG: *** ImageFile_IoCompleted: Disk operation failed. Disk path: D:\VSAN\LS-72K-Data1.img. Error code: (19).
5/10 21:07:22.552362 b64 IMG: *** ImageFile_IoCompleted: Disk operation failed. Disk path: D:\VSAN\LS-72K-Data1.img. Error code: (19).

Things I've tried:
- Virtual SAN version 12767 and 13481 (currently on 13481)
- Windows Server 2019 (this showed the Active Time as 100% and zero read/write activity when the LUN would lock up)
- Removing the SSD cache
- Removing the RAM cache
- Switching to the VMware Paravirtual storage adapter from the default LSI (currently using this)

I'm a little unsure as to what could possibly be wrong as I've been running VSAN on Hyper-V for about 4 years now in multiple instances without any hassles.

I'm not really sure where to go from here and I am using the free version so I can't lean on support in this case. Any help is greatly appreciated, otherwise I may have to fall back to using the disks directly in VMware without VSAN.

Tue May 12, 2020 9:57 am

Hi, entr0p1

What I would recommend you to do is to increase the amount of RAM to 11 GB for each StarWind VM. The recommendation is 8 GB. Add here 3 GB of RAM cache you currently have, that makes 11 GB (all reserved). Also, please make sure to reconfigure vCPU layout. There should be 8 vCPUs (4 vCPUs in 2 sockets). That is not related to the problem you have, just general recommendations.

Let's return to the problem at hand...
Looks like this problem is related to ESXi: there is a VMDK freeze. It can be caused by snapshots, underlying storage issues, and reset to SCSI events.
Could you share the logs collected with StarWind log collector (https://knowledgebase.starwindsoftware. ... collector/) with me? Use Google Drive for that, please. Any other similar service will work fine.

Questions:
1) What is the SCSI controller you use for VM?
2) Was disk eager zeroed before use?
3) Any snapshots?

entr0p1 · Wed May 13, 2020 4:59 am

Thanks for your response, once I've got more resources added to the physical host I will definitely make those changes.

I've gathered the logs and sent them to you in a PM.

The SCSI controller is a VMware Paravirtual, I've tried the LSI controller as well and get the same result. VMware tools are also installed and up to date, Windows is fully patched to date. The VMDK itself is thick provision lazy zeroed, there are no snapshots and this VM doesn't ever have them taken. I think you're onto something with the VMDK though, it would make sense as other VMDKs on the same volume (even attached to the VSAN VM itself) do not get affected when the problem occurs.

entr0p1 · Thu May 14, 2020 1:29 pm

Just an update - I've switched back to Server 2019 and re-provisioned the underlying VMDK as a Thick Provision Eager Zeroed disk. The same errors are still being generated in the logs sadly, I'll wait to hear back.

Thu May 14, 2020 4:01 pm

SCSI controller is a VMware Paravirtual

We recommend using LSI Logic SAS one. Here is the related forum thread https://communities.vmware.com/thread/408224. Please note that the disk has to be brought online after you do the change. Do not do that for the system disk, please.

thick provision lazy zeroed

We recommend using eager zeroed disks. I can see that you have re-provisioned everything as eager zeroed.

I will look into the logs, thank you for sharing them.

entr0p1 · Fri May 15, 2020 1:39 am

yaroslav (staff) wrote:We recommend using LSI Logic SAS one.

Okay no problem, I have now switched to the LSI SAS controller.

Mon May 18, 2020 6:30 pm

entr0p1

Sorry for the delayed response. I found the root cause, I believe.
5/10 20:09:45.834836 b68 IMG: *** ImageFile_IoCompleted: Disk operation failed. Disk path: E:\VSAN\SS-00K-Cache1.img. Error code: (19).
5/10 20:09:45.838936 b68 Cache: *** CacheBase::performFlashCacheErrorOperation: Error: Flash Cache is malfunctioned. Reconfigurate Cache to non-active state.

These events are logged before things went south at 20:12.
520 MEL2-STGSVS01.int.galaxieit.net 914 Error Disk operation failed. Disk path: D:\VSAN\LS-72K-Data1.img. Error code: (19). StarWindService 10.05.2020 20:12

Please note that removing flash cache file is not enough. You need to modify StarWind Device Headers.
See this article https://knowledgebase.starwindsoftware. ... dance/661/.

Here is how to remove RAM cache https://knowledgebase.starwindsoftware. ... -l1-cache/.
Maybe, the cache was not removed properly.

entr0p1 · Wed May 20, 2020 2:01 am

yaroslav,

I've discovered that with all the mucking about with the virtual storage controllers, the disks were placed into read-only mode which I missed and I'm guessing that's why you saw that error code in the logs. I have fixed the read-only disks and things have improved significantly, the issue persists though.

I've re-captured the logs just to make sure there's nothing misleading in there, the cache disk itself looks to be working perfectly (it doesn't lock up like the primary storage does). I've re-sent an updated set of logs to you in PM, let me know if you still think I should remove the cache but I'm thinking that isn't the problem at this stage

Wed May 20, 2020 9:34 am

Hi entr0p1,

Happy to know that the issue was resolved. Yes, logs look fine now: .img files can be accessed and I see operations with them.
Cache is still there but it can be accessed as well as the disk.

entr0p1 · Tue May 26, 2020 4:21 am

yaroslav (staff) wrote:Hi entr0p1,

Happy to know that the issue was resolved. Yes, logs look fine now: .img files can be accessed and I see operations with them.
Cache is still there but it can be accessed as well as the disk.

Hi yaroslav,

Thanks for all your help so far, I've just sent you a PM because we're still not quite there yet - the disk is still locking up although much less frequently. I've collected the logs after a series of failures, I can send them through if it's of any help.

Wed May 27, 2020 2:08 pm

Hi entr0p1,

If there is anything I can assist you with, I will be happy to help you.
Can you just try recreating the disk from scratch? Is there any production there?

entr0p1 · Sun Jun 14, 2020 7:10 am

For anyone who happens across this in future, we got to the bottom of it. There were several contributing factors:

1. The thick-provsion lazy zero VMDK which held the Starwind Volume needed to be a thick-provision eager zero disk
2. Using the LSI SAS controller as the virtual RAID controller for the Starwind VM
3. Applying this fix in the Starwind VM to resolve the virtual RAID controller resets: https://knowledgebase.starwindsoftware. ... ontroller/
4. Updating to the latest build of Virtual SAN

Also pay close attention if you modify the RAID controller in the VM config as Windows will very likely put the disks into a read-only mode when you bring them back online. You can use the following Powershell commands to online the disk and set it to read-write mode:

Code: Select all

set-disk -isOffline $False -Number <DISK ID NUMBER>

Code: Select all

set-disk -isReadOnly $False -Number <DISK ID NUMBER>

You can obtain the Disk ID number from:

Code: Select all

get-disk

Mon Jun 15, 2020 3:24 am

entr0p1,

Thank you for the guide.
I am really happy that the issue was resolved.