Recovering from a Node Failure

Software-based VM-centric and flash-friendly VM storage + free version

Moderators: anton (staff), art (staff), Max (staff), Anatoly (staff)

Post Reply
njs102090
Posts: 2
Joined: Wed Feb 28, 2018 1:45 pm

Wed Feb 28, 2018 8:29 pm

We are testing vSAN FREE using a 2 node disaggregated model.

I have 2 storage nodes in a FOC running StarWind. On top of the starwind image is a CSV and on the CSV lies my HA shares presented to Hyper-V nodes via a SOFS.

I have not yet found any specifics about the process the system undergoes when recovering from a node failure. We unplugged a node for 2 hours, during which I added 2 additional VMs to the CSV. I then plugged the node back in for 1 hour, and then unplugged the opposite node. My new VM folders were removed from the CSV entirely and Hyper-V could no longer contact the VM storage.

We desperately need an accurate way to monitor the resynchronization once a failed node resumes functioning. I have used several variations of the GetHaSyncStatus powershell script that comes with StarWind and with the exception of the initial sync, the script ALWAYS reports status: Synchronized on both nodes. Clearly that isn't the case.


When a node fails for 2 hours then is brought back online, WHAT happens? And HOW can I monitor the process?
Boris (staff)
Staff
Posts: 805
Joined: Fri Jul 28, 2017 8:18 am

Thu Mar 01, 2018 8:28 pm

Unfortunately, from your message I do not have a clear understanding of whether your testing has been lasting longer than for one month, during which StarWind VSAN Free offers the StarWind Management Console (GUI) for monitoring the status of StarWind devices and much more.
Basically, there are two ways to monitor the status of the StarWind devices:
1. GUI (most likely, not applicable to your case)
2. StarWindX PowerShell library that is shipped with every StarWind VSAN installation and is intended for managing the StarWind VSAN installation from command line. The StarWindX folder inside the StarWind VSAN installation folder contains sample scripts that can give you an idea of how certain actions can be performed using PowerShell. As StarWind VSAN Free is a self- and community-supported software, you are free to adjust the sample scripts to the extent they meet your the requirements of your particular environment. Otherwise, you can search the forum for the possible solutions (e.g. scripts) that the community members have ever published.
Yet, keep in mind the rule of thumb when working with StarWind VSAN, be it the free or the commercial version. You need to ensure that StarWind devices are synchronized on both nodes before shutting down any storage node. In brief, I could described everything as follows:
1. Node 1 is synchronized, node 2 is synchronized: safe to stop/restart any StarWind host. Granted your iSCSI connections to the shared storage are configured properly, which implies using multipathing, the compute hosts should be just fine with that.
2. Node 1 is synchronized, node 2 is not synchronized: stop/restart of node 2 is safe, but stop/restart of node 1 is dangerous, as it brings the whole system down. this is because the non-synchronized devices do not accept client iSCSI connections until getting fully synchronized.
3. The proper way of shutting down the whole setup, if such necessity may arise, would be to make sure all devices are synchronized, stope node 2, stop node 1 in a couple of minutes after node 1. When you need to power everything up again, switch on node 1 first and than do the same with node 2.
Feel free to let me know if you have more questions on this.
njs102090
Posts: 2
Joined: Wed Feb 28, 2018 1:45 pm

Fri Mar 02, 2018 8:31 pm

I have tested it both with the GUI and with Powershell alone. Latest configuration I am testing now is with GUI.

I have the 2 nodes sharing 2 different 10gbE connections on 2 different subnets. I've used the GUI to create an HAImage and replicate it to the partner node.

The compute nodes in the cluster are not using iSCSI to access VM storage- they are connecting via SMB to a SOFS share that resides on the CSV.

Here is my scenario:
- Create StarWind image on Node1 and setup inter-node iscsi connections and create replica image on Node2
- Create CSV from StarWind iscsi disk and setup SOFS share on the CSV
- Create a VM on a compute node with VM storage pointed to the SOFS share
- Both storage nodes show Synchronized in StarWind GUI
- Unplug both network connections from Node1
- CSV hangs for a few seconds then comes right up on Node2, no downtime.
- Plug Node1 back in- IMMEDIATELY StarWind GUI shows both nodes synchronized
- Wait several hours then unplug Node2, CSV crashes.

My question is, on that last step- how can I tell after plugging Node1 back in, when it is safe again for Node2 to go down?
Oleg(staff)
Staff
Posts: 568
Joined: Fri Nov 24, 2017 7:52 am

Tue Mar 06, 2018 9:35 am

From the explanations you gave, most probably both nodes were marked as "not synchronized" to avoid a split brain. Could you please explain, what exactly happened in StarWind GUI after unplugging connections from the second node?
You can also add more HeartBeat connections via management channels or add 1GB direct connectivity between the nodes only for HeartBeat purpose.
EMSIT
Posts: 1
Joined: Tue Mar 13, 2018 3:27 pm

Tue Mar 13, 2018 3:30 pm

Hi - We had a similar issue. A power outage and generator did not kick in. 2 nodes in our 2 node free VSAN cluster shutdown. Upon reboot the servers are back online, but the shared volume in Hyper-v is showing as offline. Is there a way to get them synced and back online via PowerShell?
Boris (staff)
Staff
Posts: 805
Joined: Fri Jul 28, 2017 8:18 am

Wed Mar 14, 2018 9:35 am

From the logs, define the node that powered off the latest - even 1ms can be enough for StarWind service to understand that its partner went not synchronized. So, based on the logs define the which one stayed synchronized longer and using the mark_as_synchronized.ps1 script from the StarWindX PowerShell examples mark the corresponding targets as synchronized on the proper node.
Post Reply