(Problem)Hyper-V 2 nodes failover cluster starwind VSAN Free

erikdrvn · Thu Oct 03, 2019 8:34 pm

I have configured a 2 nodes cluster whit windows server 2016 standard edition whit hyper-v role installed and cluster failover funcionality,
whit the heartbeat configuration , i have 4 nics on both servers NIC 1 is connected to the main switch,
NIC2 and NIC3 are connected directly between the servers,
the 2 servers have identical hardware and bios revion/configuration,
i have used the frist and the second NIC for "syncronization and heartbeat" and the third for "heartbeat",
i have created 2 target one named CV01 (2tb) for use as clustered volume and one named "Quorum"(1gb) used as witness needed by Hyper-v for the quorum policy,
i created the cluster it selected automatically the target "quorum" as "whitness" as expected,
i created a VM called VM2 as a role whit the virtual disk files located inside the clustered volume "CV01" and if i live migrate the VM from one node to another it work,
everything seems work and if i restart one node regularly the VM automatically migrate on the other node as expected and when the node go up after the reboot if i check starwind console i can see that it automatically synchronize the 2 nodes,
but if i simulate a failure on one node (i diskonnect all the 3 lan cables concurrently) then i shut down the node,
then i connect again all the 3 lan cables and put the node on again the 2 nodes not synchronizing the network interfaces are marked connected on both nodes on managment console but the nodes remain marked as "not syncronized" , the same appen if i shut down brutally one of the nodes(holding the Power button for 4 seconds)
If i click on marked as syncronized on one of the node then the other nod start to syncronizing,
and after one our or so on everything seems to be syncronized,
but i dont understand why when one node fault and go back on everything not autosyncronize..
i tryng to upload the logs taken whit log collector but the archive is 2,5 MB and seems to be to big for be uploaded how can i correctly upload this?
Thank you in advance and sorry for my english

Fri Oct 04, 2019 7:35 am

erikdrvn,

First of all, you need to understand that by pulling out ALL cables simultaneously you are in a kind of a danger, as the chances of getting split brain in your environment are really high. Usually, for any production environment we recommend having heartbeat/sync links on different physical NICs. Let's say you've got a quad-port 1Gbps card and a dual-port 10Gbps one. In this case you would use the 10Gbps ports for iSCSI and Sync respectively, 1 port for each of the links. In addition to them, the best practice would be to configure an additional heartbeat link over at least one port of your quad-port card to prevent split brain if one of your 10Gbps cards dies.
In your scenario as you have got only a quad-port card, this does not save you in case the network card goes does down in one of the servers at the physical level (not at a NIC port level). This is exactly what you are trying to simulate, and if you continue doing so you will most likely end up corrupting your data in the CSV.
If you initiate a graceful shutdown/reboot of one of your hosts, the VM(s) running on it will be migrated to the partner host on condition the new host has got enough resources to provision.
If you perform the hard power-off of one of your servers, the VMs it runs will fail. After the period of time indicated in the cluster settings (240 seconds by default) the partner host will attempt to bring up the failed VMs on condition the failed host is still inaccessible. This default period of 240 seconds can be adjusted via PowerShell as needed.
To confirm storage migration is fine in case one host going down you need to make sure your storage is connected as advised in https://www.starwindsoftware.com/resour ... erver-2016
Let me know if you need more information on this.

erikdrvn · Fri Oct 04, 2019 8:12 pm

Hi, thank you for reply,
so you are sayng that is nothing strange on my configuration (apart for the risk that if the lan card go down the nodes not have another lan card so it is out of synch)
So if one of the nodes go down not gracefully it is normal that the target not autosyncronize, And i have to do it manually?
yes could be usefull to reduce the time to less of 4 minutes for the autostart on the other node, im searching for the PS commands if you have some link or any tips it will be appreciated,
Another thing that i would configure if it is possible its that if the cable that connect the nic1 (wich its shared to the the virtual switch) on the node where the VM is at that moment it's plugged off the clustered VM it's inaccesible in the network, but the cluster remain up couse of there are 2 direct cable and the VM is not migrated on the other node automatically,
i would also ask if is possibile to configure any e-mail alert or some kind of alert to know if one of the nodes it's out of synch or it's down
Thanks in advance

Mon Oct 07, 2019 8:39 am

so you are sayng that is nothing strange on my configuration (apart for the risk that if the lan card go down the nodes not have another lan card so it is out of synch)

Correct. I do not see anything wrong except for non-redundant NICs.

So if one of the nodes go down not gracefully it is normal that the target not autosyncronize, And i have to do it manually?

No, this is not correct. Once the partner node comes online, all disks are expected to start autosync. If yours does not start, this is some kind of unexpected behavior.

yes could be usefull to reduce the time to less of 4 minutes for the autostart on the other node, im searching for the PS commands if you have some link or any tips it will be appreciated,

Code: Select all

(Get-cluster).ResiliencyDefaultPeriod =

Put any value you would like it to equal to.

Another thing that i would configure if it is possible its that if the cable that connect the nic1 (wich its shared to the the virtual switch) on the node where the VM is at that moment it's plugged off the clustered VM it's inaccesible in the network, but the cluster remain up couse of there are 2 direct cable and the VM is not migrated on the other node automatically,

To configure multiple cluster networks (hope I got your point correctly), all cluster networks would need to have access to the DNS server(s). If the links are direct, enabling cluster communication over such links will not give you positive effect.

i would also ask if is possibile to configure any e-mail alert or some kind of alert to know if one of the nodes it's out of synch or it's down

This is available with commercial licenses, yet cannot be easily configured in case of a free license. You may want to have a look at the Windows Application event log for certain events and based on those trigger some scheduled tasks sending out emails to you.

erikdrvn · Mon Oct 07, 2019 9:03 am

Ok Thank you i tryed to reduce the resilency time and it worked perfectly,
another thing that i would try is how to "mark as synchronized" one node from powershell couse the trial license will expire soon i would like to simulate a total black out and a recovery, i dont understand how to use properly commands from powershell and in the "examples" i cant find "Mark as synchronized"
Thanks

Mon Oct 07, 2019 9:25 am

First of all, just be informed the trial license will not become a free one once it expires. You will need to rebuild your StarWind environment with a free license.

As for marking a device as synchronized, you need to check SyncHADevice.ps1:
a. If you need to run synchronization for the selected device from the current node to the partner, just run the script for the device in question (default value is HAIMage1);
b. If you need to mark the device as synchronized, comment

#$device.Synchronize([SwHaSyncType]::SW_HA_SYNC_FULL, "")

and uncomment

Code: Select all

$device.MarkAsSynchronized()

Alternatively, check SyncHaDeviceAdvanced.ps1 for more options.

erikdrvn · Mon Oct 07, 2019 9:51 am

i have already a free license ,
i changed the license on the 2 nodes,
( server then click on registration then "modify" on the right, and now seems be ok no error or anything it says that the free license is registered) i have done it for understand how powershell command works,
so now i have the two nodes on free license, the trial license is completely removed, so when the day of expiring of the old trial license (that im not using anymore) will arrive,the cluster will crash?

Mon Oct 07, 2019 10:16 am

If you have already replaced the license, it is not expected to crash. It could crash only on condition the license would still be trial.