What am I missing regarding 2 Node HA

goldserve · Fri Mar 02, 2018 6:21 pm

I currently have two HA nodes set up with Hyper-V #1 set as first priority and Hyper-V #2 set as second. Everything was synchronized and working well. I simulated a failure by just rebooting Hyper-V #2 and everything survived. However, I came home to an entire ESXI failure housing Hyper-V #1 and found my Starwind Datastore also down. I had to mark Hyper-V #2 as synchronized (manually, luckily with GUI still working) and then the datastore was back but all the VMs running off the datastore halted of course.

Question: Isn't Starwind supposed to be resilient to these kinds of failure and why do I have to manually mark hosts as synchronized? I am still evaluating the solution and performance is great but this is the second time the HA part of the equation is not ideal.

Thanks!

Mon Mar 05, 2018 3:49 pm

Hi goldserve,
Could you please clarify, is that nested virtualization?
How comes that after rebooting the only second node it was available for "Marking as Synchronized" because StarWind does not allow mark any site when one of them is synchronized? We need more detailed explanation.
Most probably Datastore just lost the path and after the reboot you need manually rescan all HBA's or configure rescan script for automatic rescan in order to avoid this step.

goldserve · Mon Mar 05, 2018 7:20 pm

Let me clarify:

I have two physical ESXI servers, connected via 40G ethernet for sync and 10G switch for Heartbeat.

I installed Hyper-V and installed Starwind so I don't need to pay for any license. They share back via iSCSI to the ESXI host. ESXI Host #1 with Hyper-V (SW) #1 is first priority and ESXI Host 2 with Hyper-V (SW) #2 is second priority.

I tested Host 2 going down and my Host 1 still had access to the ESXI Datastore shared from Host 1 Hyper-V. Recently, Host 1 when down for other reasons and I found Host 2 could not access the datastore provided by Hyper-V 2. I had to manually set Hyper-V 2 as "Marked as Synchronized" and then Host 2 ESXI could access the datastore. I have multiparty working and everything so what gives?

Thanks!

Tue Mar 06, 2018 11:53 am

I have one additional question. Do you have L1 cache configured for StarWind disks?
If yes, the process of full sync stated after you turned on the second host to avoid any data loss. StarWind was in sync process there and then the first host went down. And actually, StarWind marked "not synchronized" from both sides.
And it was not the best decision to mark as synchronized second host because of the risk of data corruption. You can select the "wrong" side, with not the latest data. That is why you need to check the logs from both sides and find out the side with the latest data and only then mark.
You can find more information there about the reasons for full synchronization.
Also, can you please PM me the logs from both VMs to make sure I understand you correctly? You can use this tool.