I have a 2 Node HA Hyperconverged cluster which I have been running for close to a year now. While Starwind has been working well for most of this time, there have unfortunately been ~4 occasions over the past year where I have experienced serious data corruption events. These have been serious data loss events which have required me to restore my VM's from cold backup. This is especially painful for us because we have about 6TB worth of Exchange Mailboxes sitting on top of Starwind which take a few days to completely restore and this is highly visible and disruptive for our users. Starwind support has not been able to determine the root cause of these events and the recommended workaround has been to simply stop using LSFS. I'm not sure if LSFS is still considered a "Beta" feature and not recommended in certain scenarios, but I never came across any recommendations against using it in all of my research prior to purchasing. In fact I was partly sold on literature and knowledge articles highlighting LSFS as a major benefit. I'm writing about my experience in case others have experienced similar issues and in the hope that this will ultimately help Starwind build a better product with more stable LSFS.
Techincal specs for my environment:
2-Node Starwind HA and Microsoft Hyper-V 2016 Hyperconverged setup
Each server has the following relevant specs (servers are identical configuration):
Dell PowerEdge R730xd
Windows 2016 Datacenter Core with Hyper-V role
2x Xeon E5-2667 v4 (16 cores total)
512GB RAM (16 x 32GB DIMMS)
Dell PERC H730P Mini RAID controller with 2GB BBWC
2x 120GB Intel SSD
16x 1.6TB Intel S3610 Series SSD (rated as "mixed-use" by Dell)
6x 10Gb Intel X520 network ports
2x 1Gb Intel I350 network ports
Disk configuration:
2 X 120GB RAID1 for OS (write-through cache)
16 X 1.6TB RAID 5 virtual for Hyper-V Write-Back cache policy (I've tried disabling cache which made no difference to issue being discussed)
Network configuration:
2x10Gb NICs for Starwind Sync (direct attached via twin-ax cables)
2x10Gb NICs for iSCSI traffic (tried both direct attached and through 10Gb switches)
2x10Gb+2x1Gb NICs for VM guest traffic
Starwind Virtual Disk configuration:
1GB/Flat/HA/0 L1 cache - used for Windows Failover Cluster Witness
1TB/Flat/HA/4GB L1 Write-Back cache - used for VM disks
4TB/Flat/HA/0 L1 cache - used for VM disks
4TB/LSFS/HA/4G L1 write-back cache - used for VM disks
2TB/LSFS/HA/Dedupe/16GB L1 write-back cache - used for VM OS disks
4TB/LSFS/HA/4GB L1 write-back cache - used for VM disks
When I initially deployed the solution, I was pretty aggressive with LSFS+Dedupe of most of the volumes. I eventually backed off on these features after running into data corruption issues. To make a long story short - I was able to *reliably* trigger the system hangs/crashes which would eventually lead to system hangs, high CPU usage, then finally data corruption. All I had to do was to delete an existing LSFS device from an HA pair. I would see that this operation would take a long time (~20-30 minutes and sometimes longer) and the CPU peg at 50%-60% usage even though I took measures to make sure the volume was not in use and I had disconnected all Windows iSCSI sessions. One would think this would be a "easy" delete, so I have no idea what Starwind is doing during this period. Also during this load spike period *all* of my other volumes would claim to lose sync with it's partner and change to an unsynchronized state. Through trial and error, I eventually learned that having my VM's up and running during this event was suicide as it would (always?? I obviously didn't test all conditions to find out for sure) lead to data corruption. After working with support, we came up with a workaround - I removed dedupe (ie. rebuild + migrated) from all volumes except one and I came up with a procedure to shutdown all VM's prior to making any changes to Starwind device configuration. While this was a disappointing experience and the fact that I had to shutdown my environment to make disk changes sort of negated the HA aspect of Starwind, I accepted it for the time being. This setup was stable for about 6 months without any further incidents...
...until mid-March when Node#2 had a spontaneous vSAN process crash (vSAN process stops, errors logged, connections stop, memory dump created, etc). This went unnoticed for several days, but after I discovered the issue I rebooted the node and began scanning all disks for errors. To my relief most of the disks seemed to be error free, but I did have ~3 corrupt VM's and critically one corrupt Exchange Database was corrupted. I waited a couple days for the Starwind devices to finish sync, but there was one particular device that would not sync. For some unknown reason Node#2 was reporting that Node#1 was unavailable even though all other devices had finished syncing (all devices use same network interfaces for heartbeat/sync). Also odd was that even though #2 reported it could not connect to #1, #1 was reporting that it could connect to #2 just fine. I removed and re-added sync/heartbeat interfaces on both nodes, but it appeared no that #1 was not accepting Starwind iSCSI connections. I waited until the weekend maintenance window to address these issues as I knew it would probably involve downtime...
This past weekend was the maintenance window and the first issue I wanted to resolve was the Exchange database corruption. I wanted to restore a previous database from backup to a separate disk and scan it for errors prior to attempting a database rebuild with transaction logs. Since our stores are quite large (600-800GB each), I decided to add a temporary Hyper-V disk to hold the database. In hindsite I should have shutdown most of my environment prior to attempting this, but previous incidents had been triggered by Starwind device changes so I didn't think a Hyper-V change would be an issue. Unfortunately it seems I stumbled into a new Starwind failure mode as sometime after I finished creating my 1.5TB Hyper-V disk (thick provisioned, on top of LSFS device), the Starwind process on Node #1 crashed. I'm not exactly sure if it crashed during the disk creation or shortly after (the operation took quite some time so I was not watching it when it happened), but after Node#1 crashed all hell broke loose. After a couple of reboots on #1, I somehow ended up with *all* my data on LSFS volumes with some data corruption. Now, admittedly I did make an error that may have contributed to the problem in that I mistakenly set the "Starwind Cluster Service" to manual start when I meant to set the "Starwind Virtual SAN" to manual start (I've learned this trick to avoid issues from server updates that require several reboots to complete). I'm not sure if my error had anything to do with it, but in any case I basically lost my entire virtual environment as every disk had serious errors and most of my VM's would no longer boot.
I'm currently waiting for another maintenance window/s to remove LSFS from my environment completely, but I'm really upset that there's not some sort of warning label on that feature saying it should not be used in certain scenarios. I'm also not 100% sure that even removing LSFS will avoid the issue since my confidence in this product is very low. I'm not sure if Starwind is fully aware of these issues and I have really worked with support in good faith, so this post is mostly in the hopes it will help someone at Starwind recreate the problem and fix the issue. I am really rooting for Starwind to succeed and want to like the product, but data corruption is really unacceptable and a non-starter for a data storage product.
PS - I've searched the forum for similar posts and I've only found this one - I'm not sure if no one else is experiencing this or if this forum is highly moderated...
https://forums.starwindsoftware.com/vie ... ess#p29281
The Latest Gartner® Magic Quadrant™Hyperconverged Infrastructure Software