LSFS+HA = data corruption under high-load/high stress condit

mchang · Tue Apr 09, 2019 5:46 pm

I have a 2 Node HA Hyperconverged cluster which I have been running for close to a year now. While Starwind has been working well for most of this time, there have unfortunately been ~4 occasions over the past year where I have experienced serious data corruption events. These have been serious data loss events which have required me to restore my VM's from cold backup. This is especially painful for us because we have about 6TB worth of Exchange Mailboxes sitting on top of Starwind which take a few days to completely restore and this is highly visible and disruptive for our users. Starwind support has not been able to determine the root cause of these events and the recommended workaround has been to simply stop using LSFS. I'm not sure if LSFS is still considered a "Beta" feature and not recommended in certain scenarios, but I never came across any recommendations against using it in all of my research prior to purchasing. In fact I was partly sold on literature and knowledge articles highlighting LSFS as a major benefit. I'm writing about my experience in case others have experienced similar issues and in the hope that this will ultimately help Starwind build a better product with more stable LSFS.

Techincal specs for my environment:
2-Node Starwind HA and Microsoft Hyper-V 2016 Hyperconverged setup

Each server has the following relevant specs (servers are identical configuration):
Dell PowerEdge R730xd
Windows 2016 Datacenter Core with Hyper-V role
2x Xeon E5-2667 v4 (16 cores total)
512GB RAM (16 x 32GB DIMMS)
Dell PERC H730P Mini RAID controller with 2GB BBWC
2x 120GB Intel SSD
16x 1.6TB Intel S3610 Series SSD (rated as "mixed-use" by Dell)
6x 10Gb Intel X520 network ports
2x 1Gb Intel I350 network ports

Disk configuration:
2 X 120GB RAID1 for OS (write-through cache)
16 X 1.6TB RAID 5 virtual for Hyper-V Write-Back cache policy (I've tried disabling cache which made no difference to issue being discussed)

Network configuration:
2x10Gb NICs for Starwind Sync (direct attached via twin-ax cables)
2x10Gb NICs for iSCSI traffic (tried both direct attached and through 10Gb switches)
2x10Gb+2x1Gb NICs for VM guest traffic

Starwind Virtual Disk configuration:
1GB/Flat/HA/0 L1 cache - used for Windows Failover Cluster Witness
1TB/Flat/HA/4GB L1 Write-Back cache - used for VM disks
4TB/Flat/HA/0 L1 cache - used for VM disks
4TB/LSFS/HA/4G L1 write-back cache - used for VM disks
2TB/LSFS/HA/Dedupe/16GB L1 write-back cache - used for VM OS disks
4TB/LSFS/HA/4GB L1 write-back cache - used for VM disks

When I initially deployed the solution, I was pretty aggressive with LSFS+Dedupe of most of the volumes. I eventually backed off on these features after running into data corruption issues. To make a long story short - I was able to *reliably* trigger the system hangs/crashes which would eventually lead to system hangs, high CPU usage, then finally data corruption. All I had to do was to delete an existing LSFS device from an HA pair. I would see that this operation would take a long time (~20-30 minutes and sometimes longer) and the CPU peg at 50%-60% usage even though I took measures to make sure the volume was not in use and I had disconnected all Windows iSCSI sessions. One would think this would be a "easy" delete, so I have no idea what Starwind is doing during this period. Also during this load spike period *all* of my other volumes would claim to lose sync with it's partner and change to an unsynchronized state. Through trial and error, I eventually learned that having my VM's up and running during this event was suicide as it would (always?? I obviously didn't test all conditions to find out for sure) lead to data corruption. After working with support, we came up with a workaround - I removed dedupe (ie. rebuild + migrated) from all volumes except one and I came up with a procedure to shutdown all VM's prior to making any changes to Starwind device configuration. While this was a disappointing experience and the fact that I had to shutdown my environment to make disk changes sort of negated the HA aspect of Starwind, I accepted it for the time being. This setup was stable for about 6 months without any further incidents...

...until mid-March when Node#2 had a spontaneous vSAN process crash (vSAN process stops, errors logged, connections stop, memory dump created, etc). This went unnoticed for several days, but after I discovered the issue I rebooted the node and began scanning all disks for errors. To my relief most of the disks seemed to be error free, but I did have ~3 corrupt VM's and critically one corrupt Exchange Database was corrupted. I waited a couple days for the Starwind devices to finish sync, but there was one particular device that would not sync. For some unknown reason Node#2 was reporting that Node#1 was unavailable even though all other devices had finished syncing (all devices use same network interfaces for heartbeat/sync). Also odd was that even though #2 reported it could not connect to #1, #1 was reporting that it could connect to #2 just fine. I removed and re-added sync/heartbeat interfaces on both nodes, but it appeared no that #1 was not accepting Starwind iSCSI connections. I waited until the weekend maintenance window to address these issues as I knew it would probably involve downtime...

This past weekend was the maintenance window and the first issue I wanted to resolve was the Exchange database corruption. I wanted to restore a previous database from backup to a separate disk and scan it for errors prior to attempting a database rebuild with transaction logs. Since our stores are quite large (600-800GB each), I decided to add a temporary Hyper-V disk to hold the database. In hindsite I should have shutdown most of my environment prior to attempting this, but previous incidents had been triggered by Starwind device changes so I didn't think a Hyper-V change would be an issue. Unfortunately it seems I stumbled into a new Starwind failure mode as sometime after I finished creating my 1.5TB Hyper-V disk (thick provisioned, on top of LSFS device), the Starwind process on Node #1 crashed. I'm not exactly sure if it crashed during the disk creation or shortly after (the operation took quite some time so I was not watching it when it happened), but after Node#1 crashed all hell broke loose. After a couple of reboots on #1, I somehow ended up with *all* my data on LSFS volumes with some data corruption. Now, admittedly I did make an error that may have contributed to the problem in that I mistakenly set the "Starwind Cluster Service" to manual start when I meant to set the "Starwind Virtual SAN" to manual start (I've learned this trick to avoid issues from server updates that require several reboots to complete). I'm not sure if my error had anything to do with it, but in any case I basically lost my entire virtual environment as every disk had serious errors and most of my VM's would no longer boot.

I'm currently waiting for another maintenance window/s to remove LSFS from my environment completely, but I'm really upset that there's not some sort of warning label on that feature saying it should not be used in certain scenarios. I'm also not 100% sure that even removing LSFS will avoid the issue since my confidence in this product is very low. I'm not sure if Starwind is fully aware of these issues and I have really worked with support in good faith, so this post is mostly in the hopes it will help someone at Starwind recreate the problem and fix the issue. I am really rooting for Starwind to succeed and want to like the product, but data corruption is really unacceptable and a non-starter for a data storage product.

PS - I've searched the forum for similar posts and I've only found this one - I'm not sure if no one else is experiencing this or if this forum is highly moderated...
https://forums.starwindsoftware.com/vie ... ess#p29281

zenny7100 · Thu Apr 11, 2019 4:40 pm

Dumb question - how do you create (in powershell or C#) a device that is not LSFS? Is it true that if you DON'T put $secondNode.storageType = [StarWindFileType]::STARWIND_DD_FILE in the node creation script then LSFS will not be used? Since this is at the node level, does this mean you can't have some HA devices that are LSFS and some HA devices that are not?

Needless to say your post scared me - the last thing I need is a production server to go belly up and have clients start screaming.. I hope someone from StarWind replies.

mchang · Fri Apr 12, 2019 3:52 pm

Zenny7100 - I'm using the GUI to manage the devices, so I'm not sure how to manipulate the devices parameter from Powershell. It doesn't make sense that you wouldn't be able to mix device types just because you are using powershell/C#.

Yes, it's been a very painful experience for me. This is just speculation on my part, but I suspect my usage of LSFS+HA+Hyperconverged is creating a much greater potential for things to go very badly. In the event that Starwind HA experiences some sort of fault/error (ie. the process on one node completely crashes and memory dumps - probably the worst case scenario since it is completely unexpected for Starwind), the Starwind HA will try to failover to the other node as it should. This is likely a very high-stress and high-load event on the servers. At this critical moment the Starwind HA storage issues are noticed by my Hyper-V cluster which then *also* triggers its failover process. I believe it is at this point that my environment completely crashes from timeouts/race condition and the data corruption occurs. I've personally experienced this lasting up to 30 minutes before I have responsive consoles and get control back.

I do wonder if this would happen if I didn't have Hyper-V on the same servers...

zenny7100 · Mon Apr 15, 2019 3:45 pm

It would be interesting to try and duplicate if LSFS is not used. I've set up (using powershell) a 2 node cluster with hyper-v and windows 2016 - I set up a single VM (a sybase server) and tried pulling the plug on one of the nodes - literally to simulate a server physically dying. The end result was I had to basically start from scratch and create the devices over again since they couldn't be added to the cluster and were offline, and one of them had a sync status of 0 which is not defined. I'm in the process of trying again to make sure I didn't screw something up.

Mon Apr 15, 2019 5:35 pm

mchang, zenny7100

We really appreciate this case reported. In most cases, information on non-conventional behavior of our software from the customers undergoes deep analysis to determine the cause of those issues. Unfortunately, sometimes we cannot investigate cases to the bottom (like it was in your latest case) since StarWind logs related to the time when the issue happened got overwritten by the log rotation mechanism. If we had gotten a notification from you earlier (i.e. closer to the event), we would probably have been able to get more information about the issue.
As for LSFS, it's a very specific feature which has been designed for spindle drives in the first row, while you are are using SSDs in your setup. You can find LSFS FAQ here: https://knowledgebase.starwindsoftware. ... -lsfs-faq/
As you might notice, LSFS sometimes has issues, especially after a power outage. Our R&D and QA teams are constantly working on the features improvement and we expecting to get some fixes in the upcoming build next month.
Anyway, we have initiated additional tasks for the QA team to try reproducing your issues, as this is not anything that happens regularly, as you might have seen from the forum. I would like to assure you that your suggestion regarding the forum being intensively moderate is not close to reality, as the aim of this forum is assisting users with solving the issues they may encounter and get feedback about the products (including the negative one).
If you experience any issues with any of our feature or product, please do not hesitate to post an issue description and logs here on the forum or log a ticket to StarWind Support.

mchang · Wed Apr 24, 2019 8:03 pm

Michael,

Thanks for the response and clarification on moderation of this forum.

FYI - I worked with Boris @Starwind Support on 4/12/2019 and we were able to reproduce the high CPU usage condition in my environment. Boris was quite puzzled as he had never seen this condition before. We were able to capture logs of the incident, so hopefully that will help your team recreate and get to the bottom of this issue. As for the process crashes, that will be harder to capture logs for since it's a random event. I'll be interested to know if the logs reveal anything about the issue.

mchang · Wed Apr 24, 2019 8:10 pm

zenny7100 wrote:It would be interesting to try and duplicate if LSFS is not used. I've set up (using powershell) a 2 node cluster with hyper-v and windows 2016 - I set up a single VM (a sybase server) and tried pulling the plug on one of the nodes - literally to simulate a server physically dying. The end result was I had to basically start from scratch and create the devices over again since they couldn't be added to the cluster and were offline, and one of them had a sync status of 0 which is not defined. I'm in the process of trying again to make sure I didn't screw something up.

Zenny - interesting...I'm surprised your test produced such a result since to me the "pull the plug" method should produce a fairly straightforward and "clean" type of failure. I conducted similar tests myself when I first deployed my setup, but that was before the nodes had any serious loads on them. I would think it's a much harder problem to solve for HA nodes when the "failure" is only partial and/or intermittent.

Thu Apr 25, 2019 5:17 pm

Michael,
We will do our best to reproduce the issue and will let you know about our findings on this issue. Boris will keep you updated soon.