DataOn-Starwind Issue

mphilli7823 · Sat Dec 13, 2014 3:52 pm

Hello my fellow starwinders!

We currently are running a 2 node HA cluster using 2 DataOn DNS-1660 60 bay JBODS, The storage is configured as a 40 drive RAID 50 with hot spares.

Losing a drive periodically is to be expected, which is why we use redundancy and hot spares. However when I lose a drive, starwinds forces a full sync of all my HA images, and I lose iSCSI access to the node during the full sync. I opened a support ticket with starwinds support and they said "This is a normal behavior, since after one drive failure storage array performance degradation is observed, and it cannot be compared to performance of healthy array on the other node, so synchronization is the only way to fix it".

OK so I am curious if other users out there experience this same behavior when losing a drive in their DataON JBODS? Currently we only have about 8TB of HA images, so a full sync only takes about an hour, but my fear is what the sync time will be if we have 20T,40T, or 60T. Imagine if we had 40T and the full sync took 3-4 hours. That means 3-4 hours of running on ONE NODE, in which any further issue would cause my entire operation to come to a halt. This just seems totally asinine to me, and I have used storage from other vendors such as Compellent, EMC, NetApp, etc, and losing a drive is a minor event.

At this point I don't know if the blame lies with DataOn or Starwinds, I am just curious if other users out there share our pain. If so have you found any good work arounds?

**I just read a forum post in which Starwinds doesn't recommend using DataON JBODs, as they see the most issues with them. That post came about 18 months too late for us, as we are already invested in their product. Of course we will keep that thought in mind on our next hardware refresh cycle.
http://www.starwindsoftware.com/forums/ ... aon#p22192

Sat Dec 13, 2014 5:34 pm

Few remarks:

1) When you lose a drive in a RAID50 set you should not have virtual LUN going off-line. RAID50 does survive after single drive failure and you can do an in-line re-built @ RAID level w/o seeing whole node going off-line. Do you actually mean DOUBLE drive failure here and thus complete data store loss?

2) When something goes wrong there's a log StarWind keeps on alive node(s) and when off-line node wakes up and goes on-line we do check how much time we'll spend for a) log-based re-build (fast sync) or b) complete "seed" (full sync). For a long elapsed time off-line and log going very big it turns fast sync is actually slower then a full one... BUT. Many customers complain on this and there were some issues in our implementation so I'll ask engineers to take a closer look @ you case. Making long story short: you SHOULD NOT see full re-sync all the time node goes off-line. If it happens it's StarWind implementation rather then "flaw-by-design" issue.

3) NetApp or EMC are very different compared to StarWind. For a reason: they run dual controllers and shared back pane for storage and StarWind runs completely independent nodes (controllers) and completely independent storage pools. With EMC or NetApp (or others) there are controllers with cache memory in front of SINGLE RAID50 (or whatever pool). StarWind has TWO (or more - depends on the configuration) pools. Making long story short: we do provide better fault tolerance but need more hardware for that.

4) There are three ways for you to go:

a) Allow StarWind engineers to jump on your config and see why node goes off-line at the time it really should not. We'll fix this if it's our issue and not double failure @ node so data on partner is lost completely.

b) You can do triplication so with a single unit being AWOL you still run fault tolerant configuration. Drop me a line and I'll offer you as a free upgrade (it's Xmas time).

c) StarWind is moving to vVols and other things (object storage) so you'r not going to see single LUN soon rather you'll manage storage on per-VM basis (mgmt would be similar to Tintri but implementations are ABSOLUTELY different).

5) In this particular case DataOn (or better say Quanta? I don't really know who's OEM-ing in what direction...) has nothing to do with the issue. Every single pint of hate belongs to StarWind

mphilli7823 · Wed Dec 17, 2014 7:36 pm

1. From the hardware perspective when we lose a drive in our RAID50 everything works as designed. Once the drive fails, the system replaces the failed drive with a hotspare, and we actually have 2 hotspares in each system. So we have never, and hopefully will never see a simultaneous 2 drive failure.

2. If I do planned maintenance, say something like windows patches to a node, and I stop the starwinds service, then reboot after patches, the system always does a fast sync. The only time I ever see a FULL SYNC is when we lose a drive.

4. I would like to proceed with having some of your engineers take a look at our system to see if they can figure out what's going on. 4B is not an option at this time, because introducing another node involves additional capital investments. I will open another ticket and reference this forum post.

Fri Dec 19, 2014 5:07 pm

Hi! Just want to let you know that we received your request, and we will schedule the remote session with you via email.

barrysmoke · Mon Mar 30, 2015 10:28 pm

I'm interested in what happened with this large storage issue...any updates?

Wed Apr 01, 2015 4:13 pm

This update is on the way. But we expect it in May.