Cluster crash - Starwind CSV says Error...

Software-based VM-centric and flash-friendly VM storage + free version

Moderators: anton (staff), art (staff), Max (staff), Anatoly (staff)

Post Reply
BenM
Posts: 35
Joined: Wed Oct 24, 2018 7:17 am

Thu Oct 31, 2019 5:26 pm

Guys,

Some unspeakable person randomly switched off the servers today. one we will call 05 the other 06

I get called in.... find Starwind offline. Says non Active for the CSV on both 05 and 06.

Windows cluster has conveniently removed the most down node (06) from the cluster according to the cluster event log, however according to Failover Cluster manager on 05 it is up, on 06 FOCM turns up blank.

The CSV is marked as down on node 05.

I don't want to start troubleshooting the cluster (including bringing the CSV back online) until starwind is happy again.

As far as I can see the network connections for both cluster nodes are up and functioning as expected from a windows point of view - 05 shows them as active in FOCM and also in the Network Centre; 06 shows them as active in the network centre but FOCM doesn't show anything.

I have the logs from both nodes ready to upload when I get home...

I hope its just a case of getting SW to start replicating again and then bring up the CSV...

Ben
Last edited by BenM on Thu Oct 31, 2019 8:42 pm, edited 1 time in total.
Boris (staff)
Staff
Posts: 805
Joined: Fri Jul 28, 2017 8:18 am

Thu Oct 31, 2019 8:26 pm

BenM
Posts: 35
Joined: Wed Oct 24, 2018 7:17 am

Thu Oct 31, 2019 8:40 pm

Ahhh - Thanks for that. That behaviour seems most sensible.

I will check tomorrow but probably mark 05 as synched and let the synch complete.... then I just have to ressurrect the cluster somehow!

B.
Boris (staff)
Staff
Posts: 805
Joined: Fri Jul 28, 2017 8:18 am

Fri Nov 01, 2019 5:00 am

Make sure to mark the correct node after proper logs check. Otherwise the VMs may be BSODing.
Once the disks are marked as synchronized, the cluster will get its storage and will become fully operational.
BenM
Posts: 35
Joined: Wed Oct 24, 2018 7:17 am

Fri Nov 01, 2019 8:39 am

hi

just had a look back through the application log on each server. Neither have any starwind events for up to 3 hours before the idiots switched everything off.

Am I expecting events from the Starwind Service saying things about not being able to determine the truth about whoich data is current? Becasue I am not getting them - before or after the power off.

Should I be looking for the last Synch event from the Starwind Service before the power event?

The network was pretty quiet before the power off event at 1100ish as we are a school and are on holiday at the moment - I have checked back to 0800 an no Starwind events..

Will hunt back further; thanks for your help so far

Ben
BenM
Posts: 35
Joined: Wed Oct 24, 2018 7:17 am

Fri Nov 01, 2019 9:07 am

Hi again,

I assume that because we have SW Free the option to mark a device as synchronised isn't available through the UI...

I assume that there would be the powershell do do the same thing? (hints and tips appreciated!)

Also the article you linked could do with a little explaining (just so I get it right)

"If you know for sure which node has the most recent data – choose it as the synchronization source..."

OK assume we do know for sure which node has the most recent data, do we not need to stop the Starwind service before changing the device state? because the next paragraph implies that we do...

"If you are not sure which node has the actual data, stop the StarWind service on all nodes, mark the device on one node as Synchronized and check data consistency"

So (and I am thinking aloud here - please bear with me) we should stop all the Starwind services on all the nodes. Pick one, mark the data as synchronised start the service and run a consistency check... I assume there is powershell to do the consistency check?

If the data is inconsistent stop the service and move on to the next node, mark as synched and check consistency. If this works (It should, there are only two nodes) start the starwind service on the other node then drink coffee until the synch has completed.

Going back to your previuous reply - once the CSV is consistent and synched the cluster should recover itself... the Witness volume synched OK and appears in FOCM/Starwind as expected.

05 seems to have a semi working cluster on it (05 FOCM lists roles, both nodes and Witness/CSV storage. 06 FOCM is empty) so that would probably be a good bet for the consistent data on CSV1 (even though CSV1 is down at the moment)

B.
Boris (staff)
Staff
Posts: 805
Joined: Fri Jul 28, 2017 8:18 am

Fri Nov 01, 2019 9:35 am

I believe you would need to confirm the power-off time for both servers first. You should see it in the Windows System logs on both servers. Right before that event make sure to search for any events regarding the stop of the StarWind VSAN service. In the Windows Application logs you will see some messages for non-synchronized state of the devices. Based on that information you can decide on which node to mark as synchronized.
If your case is a hard one with someone just performing a hard power-off of the servers, in the Windows System logs you will see event ID 6008 (improper shutdown). Notice the time indicated in that event for both servers. After that, compare the time settings for the two servers and take into account any time different between them. Based on this information, select the node to mark as synchronized. Before doing so, I would recommend stopping and disabling the StarWind service on the other node(s) in the setup not to trigger automatic sync. This is done to ensure data integrity even if you mark the wrong node as synchronized. After you confirm the correct node was selected and marked, go ahead and start the partner's StarWind VSAN service.
For StarWind VSAN Free, you would need to use the script called SyncHaDevice.ps1 and select the proper action to perform. If you are to mark some certain device as synchronized, make sure to uncomment

Code: Select all

$device.MarkAsSynchronized()
and comment out the preceding line.
Hope this helps, but feel free to ask for any clarifications.
BenM
Posts: 35
Joined: Wed Oct 24, 2018 7:17 am

Fri Nov 01, 2019 10:09 am

hi Boris
Sorry for the thinking aloud stuff.... but here is some more... which I was (mostly) typing as you replied!

Both servers were switched off with some idiot knocking out the power breakers... so event ID 6008 it is, which indicates a shutdown of 10:02 for node 05 and 10:17 for node 06.
From the event log 05 restarted at 11:01:31.49 and 06 restarted at 11:01:38.49 (Event ID 12 for the Kernel in the system log).
06 appears to be 4s fast compared to 05, so 06 definitely booted first (currently at 09:53.00 on 05, within .5 second - the time taken for the kvm switch to swap over, 06 said 09:53:04 )

I will choose 06 to mark synchronized - especially as 05 was killed 15 minutes before 06.

Steps:
1. stop and disable starwind service on 05
2. on 06 run the following PowerShell

Import-Module StarWindX
Enable-SWXLog
$Server = New-SWServer 127.0.0.1 <user> <password>
$server.connect()
try{
$device-get-device $Server -name "<device name>"
$device.MarkAsSynchronized()
}
Catch
{
Write-Host $_ -foreground red
}
finally
{
$server.disconnect()
}

3. Check the HA device status on 06 until it says synchronized (could be instant - who knows)
3a. Reboot 06
4. Enable and start StarWind Service on 05 (enable & reboot 05?)
5. hope the cluster sorts itself out... if SW is back, hopefully everything will resurrect itself.
Boris (staff)
Staff
Posts: 805
Joined: Fri Jul 28, 2017 8:18 am

Fri Nov 01, 2019 11:21 am

Do not do 3a, otherwise you break things again.
Step 3 takes seconds (make sure to do that for all cluster disks). Once it is made, check the cluster as all drives should come online. After that you can enable the StarWind service on node 05.
Do not reboot 06 until 05 becomes fully synchronized, otherwise you will kill your environment once again and will have to play with it even more.
By the way, why would you want to reboot 06 at all?
BenM
Posts: 35
Joined: Wed Oct 24, 2018 7:17 am

Fri Nov 01, 2019 11:49 am

HI again


3a was an afterthought and obviously a bad one.

I was hoping that it would all just work(tm) but sadly no....

On 06 the UI says the partner is not synchronised and that all the heartbeat/sync interfaces are down.
The not synchronized is right - we have just forced the local disk to be synchronized; i wouldn't expect it to be synched yet!
the UI on 05 the CSV is shown as Not Active; there are no interfaces listed.

The interfaces down I can't understand as I can ping both the local and remote addresses of the Synch and HB interfaces from 06 and 05...

In the Starwind user interface on both nodes the Witness disk is shown as online and synchronized. The CSV is shown as "not active".

Note that the fail over cluster manager on 06 is still empty of config, yet the one on 05 says the CSV is back up.

Ben
Boris (staff)
Staff
Posts: 805
Joined: Fri Jul 28, 2017 8:18 am

Fri Nov 01, 2019 1:23 pm

What is the state of StarWind VSAN service on 05? If it's disabled, then all those messages totally make sense. You need to enable the service for the sync to start.
As for the failover cluster manager, it simply did not pull the configuration from the host and thus was not able to pick up the cluster. In the right pane press "Connect to cluster" and hit OK. You should be fine from that point on.
BenM
Posts: 35
Joined: Wed Oct 24, 2018 7:17 am

Fri Nov 01, 2019 1:28 pm

on 05 (and 06)

Starwind vSAN is running, Automatic start
StarWind SMI-S is running, automatic(Deleayed start)
StarWind VSS is running Automatic
StarWind Cluster is running automatic.

Thanks for the FoCM tip - its the first time I have needed to do that (in many years - you don't know what you don't know!)

p.s. once sync has started, is it best to wait for sync to complete be fore starting the cluster roles again? I would guess so... academic question - somehow the cluster has started roles.... yet the Starwind interface still shows the CSV as offline.... the latest cluster event trace says that the roles are halfway between online and offline though those were a while ago; I guess thats a symptom of the "undefined" state of StarWind.
Boris (staff)
Staff
Posts: 805
Joined: Fri Jul 28, 2017 8:18 am

Fri Nov 01, 2019 2:33 pm

BenM,

Can you post some screenshots of what you describe? You can do so via PM, if you don't want to share them to the public.
Usually, there is no limitation regarding roles start while StarWind HA disks are not yet fully synced. You can start the roles if necessary, just make sure they are not impacted by the sync process. If they are, locate the Change Synchronization Priority menu item in StarWind Management Console for the HA device in question and adjust it so that the sync process would get lower share of traffic compared to the clients.
BenM
Posts: 35
Joined: Wed Oct 24, 2018 7:17 am

Tue Nov 05, 2019 1:10 pm

Just to wrap up the thread - Everything is now working as expected.

Delay cause by end user inability to follow simple instructions and a slow(ish) synch network (neither of which are a Starwind problem)

Thanks to Boris for tolerating an incompetant user and getting everything working again.

StarWind Free remains, IMHO, a fantastic piece of software which performs even in the most adverse of circumstances.
Boris (staff)
Staff
Posts: 805
Joined: Fri Jul 28, 2017 8:18 am

Wed Nov 06, 2019 9:36 am

BenM,

Thanks for the confirmation. It's great to know we were able to bring everything to the state as before.
Post Reply