Hyper-V Host Update - VSAN Failover Issues

Software-based VM-centric and flash-friendly VM storage + free version

Moderators: anton (staff), art (staff), Max (staff), Anatoly (staff)

Post Reply
bg_IT
Posts: 14
Joined: Mon Aug 13, 2018 11:22 pm

Sun Feb 06, 2022 10:24 pm

I work for a 24-7 call center and have had great success with Starwinds VSAN product. Being able to move VMs between Hyper-V hosts with minimal downtime and then perform updates has been a life saver. A few months ago I had to rebuild a RAID array after hardware failure. When complete I added the new partner to to the VSAN (done with help from this forum - thank you!) and everything has been working well since.

I'm not sure if it related, but this past weekend I applied windows updates to the originally existing Node (Node2) and failed everything over to the rebuilt node (Node3). Everything was going along nicely, Node 2 had been down for about 5 minutes restarting after the update when I watched all the VMs go into a non-running state. The iSCSI initiator showed both Node2 and Node3 to be reconnecting on the currently running server. I restarted Starwind service on Node3 with no result. Node2 came back up after 15 or so minutes but the HAImage that has most of our VMs on it remained down (HAImage2).

./getHASyncState on Node2 and Node3 returned "200 Failed: can't find partner node.."

I couldn't find any network errors each host could ping each other over all the interfaces.

After about 30 minutes the the HAImage2 on Node2 went into a Synchronized state and ./getHASyncState showed Node3 to be synchronizing off node2 (just the opposite of what I would have expected)

I have been able to bring all VMs back up and have recovered from the failure temporarily but need to find the cause. Can someone help point me in the right direction?

Thank you,
yaroslav (staff)
Staff
Posts: 2361
Joined: Mon Nov 18, 2019 11:11 am

Mon Feb 07, 2022 3:16 am

After about 30 minutes the the HAImage2 on Node2 went into a Synchronized state and ./getHASyncState showed Node3 to be synchronizing off node2 (just the opposite of what I would have expected)
This can point to the storage delays. But I need the logs to see the cause. Please share the logs from all StarWind VSAN hosts and send them here https://knowledgebase.starwindsoftware. ... collector/.
bg_IT
Posts: 14
Joined: Mon Aug 13, 2018 11:22 pm

Mon Feb 07, 2022 2:16 pm

Thank you for your help yaroclav. I've emailed both zipped files to support@starwind.com
bg_IT
Posts: 14
Joined: Mon Aug 13, 2018 11:22 pm

Tue Feb 08, 2022 5:08 pm

Thank you Yaroslov for the email. It was very thorough and I will follow the recommended restart procedure and the additional POA you provided.

I noticed that the time zone was incorrectly set on Node3. Node 3 ran for 1 hour before I rebooted it in an effort to resolve the problem. Node2 restart: 20:54 and Node3 restart: 21:52 (corrected for time zone)

Can you tell what caused the Node3 VSAN to become unresponsive in the hour just before reboot. This would have occurred at roughly 2/5/2022 17:54 in the Node 3 logs. I worry about this because in a failure situation I won't be able to follow the proper shut down procedure.

Thanks again for your help - I really appreciate it.
yaroslav (staff)
Staff
Posts: 2361
Joined: Mon Nov 18, 2019 11:11 am

Wed Feb 09, 2022 8:49 am

Sadly, we need VSAN to be updated to let us learn more from the logs. Logging has improved throughout time and if the issue happens once again I will be able to learn more from the logs.
Post Reply