Massive data corruption after SW reboot

craggy · Mon Jul 27, 2015 2:57 pm

We have had several occurrences of massive corruption after Starwind servers were rebooted.

We have about 60TB data served behind each SW box.
Each box has 3x iScsi targets, each serving a 20TB volume to ESXi hosts.
Each 20TB LUN is a Flat file (no LSFS) and has 8GB L1 cache and 60GB SSD L2 caches in WB mode.

The last time we rebooted both Starwind boxes they each hung at the shutdown screen for over an hour.
We were forced to power them off manually and when they came back on the VMs were brought back online but we had massive (and i mean MASSIVE) file corruption across all volumes.
It took many days of chkdsk repairs to get things back to normal and we had several hundred GBs of data corrupted to the point were it was screwed completely.

We tried to do an update of SW software on one box at a later stage in case it was a bug and while stopping the SW service it hung for well over an hour again and eventually the process stopped with a timeout error but once we restarted SW and brought the volumes back online the VMs once again showed extensive corruption.

Our investigations have led us to believe that the use of L2 cache on the SSDs is the cause of the issue. When monitoring disk writes to the 20TB LUNs we see huge writes going to the SSD Cache but very little to the actual main datastores.
I think the cache flush algorithms are at fault as the data is being written to the SSD cache and not to the main volume, or at least not fully.
I assume that when the SW process is being stopped during a server shutdown all the data sitting in the SSD cache is being written out to disk which takes a long time and is causing the SW process to hang at the stopping stage which eventually bombs out without finishing. Once the server is rebooted the L2 cache is dumped by SW and the data is lost leading to our corruption.

So I have some questions:
Why is the cache not flushing periodically or as soon as the disk queue has decreased to the main data store?
Why is the SW process taking so long to stop?
Why is the data sitting in the L2 cache being dumped on a server reboot? This is a serious flaw. No raid controller would dump your data in cache waiting to be written after a reboot so why would Starwind do it?
Is there any way to manually request SW to write all cached data to disk?

We are going to need to reboot our servers again someday. That is inevitable.
We need a way to be able to sync the cache to disk while the service is in a running state and be able to stop the SW service gracefully without suffering more corruption.

I have been told elsewhere on this forum that the use of L2 cache in WB mode is not recommended.
I completely disagree as we have been using this method of cache on Nexenta and Solaris with ZFS for years to massively increase performance and have never had an adverse effect from a reboot.

What can we do to fix this issue on Starwind?

lucki · Tue Jul 28, 2015 2:29 am

buy support/production licence, no one from star wind will actually answer your questions

craggy · Tue Jul 28, 2015 10:44 am

Why would buying a production license make any difference here?
The codebase is identical whether it's free or paid for.

If this kind of major issue is happening on a free license how we could ever contemplate paying for a production license if the software has flaws that make it unfit for production use in this scenario?

lucki · Tue Jul 28, 2015 12:57 pm

with a production key, you get 24/7 support
their *engineers* and *R&D* departments are readily to help you out provided

don't get me wrong, if the software works it doesn't matter if it is production vs. free other than the facts you get less feature

my data was corrupted as well, and got starwind involved
the only way i was going to get any help was them telling me i had to buy a production key and that recovery was at best 60% not guaranteed

darklight · Tue Jul 28, 2015 5:02 pm

As far as I know, L2 cache still causing issues on some specific environments or hardware. So consider disabling L2 cache until newer build comes out.

As for me, I disabled it and now temporarily using my SSDs as a hi-speed flat device for SQL temp DB's and other high-loaded applications. Works fine for me so far but still looking forward to L2 cache being fixed completely.

Tarass (Staff) · Tue Aug 11, 2015 10:48 am

Hi all and thanks for reporting and investigating. We really appreciate you help and contribution.

New build is in testing phase, ETA 2 weeks.

craggy · Wed Aug 19, 2015 3:28 pm

Hi

Yes, I would happily disable L2 cache at this stage but the problem is that if I try to stop the SW service to upgrade SW or reboot the server or even to just disable L2 cache and bring the targets back online i'm going to wind up with the same situation as before where all data in the L2 WB cache doesn't get written to the array and huge corruption occurs.

I can't afford to have this happen again or I will definitely lose customers over it.

So how do I flush L2 cache to disk without bringing down the targets?

craggy · Mon Aug 24, 2015 10:27 am

Are any of the Admins able to offer advice on how we can safely shutdown SW so we can disable L2 cache?

Is there a command we can run somewhere to force a cache flush to disk?

darklight · Tue Aug 25, 2015 9:43 am

Disconnecting clients from iSCSI targets preventing new data to be put into L2 cache did the job for me. Service has gone down properly and in a couple of minutes.
BTW the recommended settings for L2 cache is write-through... maybe that's is also the reason.

craggy · Tue Aug 25, 2015 1:55 pm

We have tried this by shutting down all VMs running on the datastore and shutting down all the hosts after that.
Then we left it about 15 mins and tried to shutdown the SW server assuming that all pending writes in the L2 cache had been written out but the SW server sat there for an hour trying to stop the SW service and it eventually bombed out and the server shutdown but when we brought everything back online loads of data was missing that we assume was sitting in the L2 cache but hadn't been written to storage and was subsequently dropped once the SW server rebooted.

We would happily disable L2 cache at this stage but can't do that without stopping the SW service and we can't stop the service without running into the same data corruption issue because of the currently enabled L2 cache.
This is like a time bomb waiting to go off until we can flush the cache and stop the SW service safely so we can disable the L2 cache.

darklight · Mon Aug 31, 2015 6:04 pm

Did you reboot the whole server? I would try to just stop the StarWind service and see what happens. You can run performance monitor and see what is going on during the shutdown of the service, which files are in use and so on... probably get more information about the SW behavior on shutdown

craggy · Wed Sep 09, 2015 8:57 am

We've done both stopping the service and shutdown, the behavior is the same either way.

The problem is that we can't risk stopping the service or shutting down again until we can first force a cache flush to disk.
If we try stop the service again we will have the same corruption issue are my original post.

I need to find a way to flush the L2 WB cache to stable storage while the SW service is running. Then I think I could safely stop it and disable L2 cache.

Can any of the SW admins help with this please?

Vladislav (Staff) · Sat Sep 12, 2015 2:52 pm

Hello craggy,

We are almost finished preparing a PowerShell script for you which will force cache flush to disk.

In order for us to finalize it, could you please clarify your StarWind configuration.

How many StarWind nodes you have?
Do you have HA StarWind devices mirrored between nodes or do you have a standalone devices on each node?

Thank you.

craggy · Sun Sep 13, 2015 12:47 am

Hi

We only have a single SW node. No HA in use.

Thanks

Mon Sep 14, 2015 9:34 am

Don't go with a write-back cache in such a case.

craggy wrote:Hi

We only have a single SW node. No HA in use.

Thanks