Discovered HA issue in version 5.5 final release

sls · Fri Jan 07, 2011 7:56 pm

We've discovered an issue in the 5.5 finanl release. The 5.5 release note indicates such improvement below.

High Availability: If both nodes were down and then full sync has been initiated, device becomes available for client connections immediately. Requests are processed by the node which is selected as synchronization source.

Acccording to this release note, when the HA target is in full sync stage, the data should be accessibe from the primary node. We carefully tested it on our test environment and found out this feature doestn't work.

Here is what we have done so far.

1. Created a new HA target.

2. Full sync the partner target.

3. While the partner is being full sync, we attempted to create a new data store on the HA target that is being synced and get error. The store is not able to create while the HA target is being full sync. We tried this many times on different ESX 4.1 servers and get the same result.

4. After the HA target is fully synced, we can successfully create the new data store just like normal.

5. After we created the new data store in ESX, we went back in the starwind GUI and forced the HA target to do a full sync again just to simulate both nodes went down and come back up with full sync stage.

6. After we forced the HA target to do a full sync, we can browse the content in the HA target data store in any ESC server. It seems like the HA function is working. However, when we attempted to vMotion a 4GB virtual machine into the HA store, the process will just like hung and eventually error out with time out error.

7. After the HA target is fully synced again, we can successfully vMotion a virtual machine in the data store.
Base on our test result, the HA target is not going to full operational when the target is in syncing stage.

This is going to be a deadly thread in production environment. Just imagine if you have production data on HA target. One node went down becasue of multiple disk failure. When you rebuilt the failed node and add it back online and perfrom a full sync with the active node, the data on the active node becomes inaccessible. You have to wait until the target is fully insync again.

I think we can accept the fact the data accessing is slower in full sync stage but data inaccessible is not acceptable in the HA setup.
Starwind software engineer needs to fix this limitation ASAP in order to make the HA feature really High Availability.

Constantin (staff) · Mon Jan 10, 2011 2:17 pm

As written in changelog, this mechanism begins to work only if both nodes were down ie HA was not synchronized.

Constantin (staff) · Tue Jan 11, 2011 12:06 pm

Let me explain in details my previous post because it was really short.

You create new HA volume and choosing full sync. What happens here: you create new HA device that is not initialized, and never before been synchronized. Now it`s making full synchronization. In this case full sync is filling both images with zeros. That`s why you were not able to create datastore on such target. After ending filling targets with zeros - all works OK, as it should. To avoid filling images with zeros choose option "not to synchronize". This will create fully functional HA don`t filling them with zeros.
Also when you simulate failure by running full sync - here this new function begins to work, and during sync you have access to one node.
Situation when vMotion doesn`t work here can be cause by low performance of total environment. You can increase timeout following way:
To modify fsr.maxSwitchoverSeconds:

Right-click the virtual machine and click Edit Settings.
Click the Options > Advanced > General.
Click Configuration Parameters.
From the Configuration Parameters window, click Add Row.
For the Name field, fsr.maxSwitchoverSeconds and for the Value field enter the new timeout value.
Click OK twice to save.
Restart the virtual machine for the changes to take effect.

You can also modify the .vmx file manually and add the following line to the file:

fsr.maxSwitchoverSeconds = "<value>"

sls · Thu Jan 13, 2011 5:49 am

It doesn't matter if both nodes are down or not. we've done the following further test.

1. Create a new VM on a storage that is served by a HA target.
2. Run IOmeter with 100% write on the VM and let it continuely run.
3. Restart the secondary node.
While the 2nd node is restarting, the IOmeter speed jump from 30MB/s to 50MB/s.
4. The 2nd node came back online and it started the full sync becase of too much data change and different from the 1st node.
5. As soon as the full sync process started, the IOMeter speed drop to 0MB/s (no more IO) and the VM froze.
6. If we attempt to force reboot the VM, the VM won't start and complains the VMDK file not found.
7. We power off both nodes and restart them back up.
8. The HA target won't start full sync until we force it to do so.
9. After we brought back both nodes and started the full sync process, we attempt to browser the HA datastore on the ESX server. We can see the content but if we create a new folder and attempt to copy a small 20MB ISO file to the new folder, it won't write to it.
10. We restarted the ESX server
11. After the ESX server is restarted, we check the round robin path on the iSCSI HA target. The path to the secondary node is showing dead, and the path to the primary node is showing active. This is what we expected.
12. When we attemp to browse the HA data store, the volume is blank. That is scary.
13. We waited other 5 hours until the HA targets are fully synced and browse the HA data store again. All the content is back.

According to the test result we have, it doesn't matter if both nodes were down or not. As soon as the HA target is in full sync stage, the HA target blocks the iSCSI initiator connection from reading and writing to the target. In the other word, when the HA target is being full synced, the primary node accept the iSCSI initiator connection but BLOCKS THE IO operation. This behavior is not going to qualify as "HA".

We expect the HA setup always continuely serve the data as soon as one of the node is holding the most up to date data. In the design perspetive, we are looking for a setup that no condition should cause the data interruption. The only interruption is when both nodes are power down.
When we setting up HA storage, we are expecting the storage never go down as soon as one node is still running just like RAID-1. However, base on our test, when both nodes are running, everything is happy. When one of the node is down or reboot, it becomes disaster becuase the HA target will block the IO operation when it needs full sync. The full sync takes 5 hours on a 2TB image file.

Please have the software engineers to go back the programing to find out why the "HA" function is acting disater and please fix it. We still have the setup in our environment. Let us know if you need the log on our servers.

Constantin (staff) · Thu Jan 13, 2011 4:01 pm

Well, when you perform high load in VM with IOMeter on storage when HA is healthy that`s OK. When you shutdown one node performance grows because we don`t have to wait for reply from second node. When you begin full sync during loading VM with IOMeter you create load too high for your storage that`s why performance drops to zero. Imagine that sync is very I/O heavy process, and during it you also try to run another heavy process - your hardware can`t handle it. I`m not sure that speed really was 0 MB/s, I think that esxtop would provide you with more real statistic. Also here VM will slow down because of reseting the reservation and reconstructions of ESX to target because initiator fails to receive reply from target.
In point 9 - need to check vmkernel logs. ESX can busy by handling reservations.
Point 13 - after all data was not corrupted, and it`s good

StarWind is stable solution, and we constantly perform hardening testing

Aitor_Ibarra · Thu Jan 13, 2011 4:14 pm

Hi sls,

I don't know why you are getting the problem you are seeing, but it doesn't reflect my experience at all. I tested 5.5 (and betas of 5.5) extensively and am now in production with multiple hyper-v VMs running off multiple CSVs on multiple Starwind HA targets. Part of my testing involved a continous write/verify task that forced a fast sync twice an hour, with multiple TB written over several days at speeds often exceeding your iometer test. I only had to do full syncs after much more extended periods. I've just today had to do four resyncs due to latest patches from Microsoft (I run Starwind as a VM, so have to patch host and guest, each requires shutdown of Starwind VM). All of these resyncs were fast - a matter of a few seconds after they started.

Let's try to rule out some obvious things...

1) Starwind build - I am on 20101029
2) You are using seperate networks for sync and initiators?
3) Heartbeat is on a different network to sync?
4) There is space on the drives where your IMGs are held for the FastSynchLog.dat files?

I can access HA targets during a full sync, but performance isn't good - this is only to be expected as the sync demands are competing with the clients, and when you've got lots of VMs using an HA target for their boot drives, well, forget it. If you know a full sync is going to be necessarry, and you have to do heavy i/o on the target at the same time as the sync, you really should move the data to another target first. I'm sure vmware can do this - in hyper-v land, "quick storage migration" in SCVMM2008R2 can move the VHDs from one location to another while keeping the VM running.

From information provided by Alex on the beta forum, I came up with this ready reckoner for when you won't be able to do a fast sync:

http://www.starwindsoftware.com/forums/ ... tml#p12055

Your targets are 2TB, so your fast sync log needs to be about 395MB and should be good for about 32 million write ops before you need to do a full sync.

hope this helps,

Aitor

Thu Jan 13, 2011 4:53 pm

We're checking this particular scenario at this moment.

megacc · Fri Jan 14, 2011 3:04 pm

Hi sls,

same here as soon as we upgrade to sw 5.5 and start to use HA wb cache devices :
1- cann't access healthy node during either full or fast sync
2-sync process is very very slow compared to sw 5.3 (no more than 3% of 10g nic).
3-very slow performance when both node are healthy and connected to iscsi initiator client (no more than 6% of 10g nic).
4- sw service sometimes restart with no reason.

compared to sw 5.3(build 20100323):
1-you can access healthy node during fast or full sync with good i/o performance.
2-sync process is very fast (35%-30% of 10g nic).
3-fast and stable performance when iscsi clients connect to healthy HA device.(we were able to hit 42% of 10g nic)
4-sw 5.3 although it doesn't have the wb cache option (which is why we were testing sw 5.5 to gain more i/o performance) is far more stable and fast ,reliable than sw 5.5

we are now using sw 5.3 and we are very happy and pleased with it , so i recommend to you sw 5.3 as it until now the most stable and reliable version so far we test it until now .

Sat Jan 15, 2011 7:38 pm

1) You can have LIMITED access to the storage node while whole storage cluster is not healthy (as there's a huge load on I/O subsystem taking tons of CPU cycled and network bandwidth). But if you have NO access it's a failure you have to report to support (and we fix it).

2) We've at least doubled sync compared to 5.3/5.4 so if you see different picture again please report to support so we could jump on remote session and see what's wrong.

3) Again, with new write-back cache 5.5 is at least 1.5 times faster compared to the old versions. So something goes terribly wrong on your side. And I'd like to know WHAT exactly if you don't mind.

4) And here's I'm confused. StarWind service has NO feature as "restart". Even if we'd imagine service had crashed there's nobody making it start again. So what exactly happens? Can you describe?

megacc wrote:Hi sls,

same here as soon as we upgrade to sw 5.5 and start to use HA wb cache devices :
1- cann't access healthy node during either full or fast sync
2-sync process is very very slow compared to sw 5.3 (no more than 3% of 10g nic).
3-very slow performance when both node are healthy and connected to iscsi initiator client (no more than 6% of 10g nic).
4- sw service sometimes restart with no reason.

compared to sw 5.3(build 20100323):
1-you can access healthy node during fast or full sync with good i/o performance.
2-sync process is very fast (35%-30% of 10g nic).
3-fast and stable performance when iscsi clients connect to healthy HA device.(we were able to hit 42% of 10g nic)
4-sw 5.3 although it doesn't have the wb cache option (which is why we were testing sw 5.5 to gain more i/o performance) is far more stable and fast ,reliable than sw 5.5

we are now using sw 5.3 and we are very happy and pleased with it , so i recommend to you sw 5.3 as it until now the most stable and reliable version so far we test it until now .