Hyper-V Hyper-Converged Cluster Issue

Software-based VM-centric and flash-friendly VM storage + free version

Moderators: anton (staff), art (staff), Max (staff), Anatoly (staff)

Post Reply
MeCJay12
Posts: 16
Joined: Fri May 07, 2021 1:25 pm

Fri May 07, 2021 1:52 pm

Hello! I have a Hyper-V hyper converged cluster that I am having trouble with. The cluster is two Windows Server 2016 Datacenter nodes with dual 10G uplinks to a switch for management and sync plus a 1G direct connection between the nodes for heartbeat. Hyper-V quorum/witness disk is on a third server but all VM/cluster storage is on the nodes and replicated with Starwind vSAN. The issue that I am having is that when one node is rebooted gracefully, it causes the other to crash. Looking at the Hyper-V logs, it appears that shortly after the first node shuts down, the second node looses access to it's cluster disks and panics. Looking at the Starwind vSAN logs, I'm not sure why. It just notes that it lost sync/heartbeat access then exits. Thanks in advance for any help.

Starwind log from gracefully rebooted server https://pastebin.com/gEuHXrqp paste password: sYHjWqSbnS

Starwind log from crashed server https://pastebin.com/nvdJtb4g paste password: 4ButwFTaPd

Unlimited License
yaroslav (staff)
Staff
Posts: 2277
Joined: Mon Nov 18, 2019 11:11 am

Mon May 10, 2021 8:28 am

Hi,

The logs you shared are not enough. Please collect the logs as describred here https://knowledgebase.starwindsoftware. ... collector/.
MeCJay12
Posts: 16
Joined: Fri May 07, 2021 1:25 pm

Tue May 11, 2021 3:56 am

My bad. See attached. Logs were tool large to attach.

Trooper is the gracefully rebooted machine: https://oc.cshaheen.tech/index.php/s/V2m6NrKURaKVc96

Duper is the crashed machine: https://oc.cshaheen.tech/index.php/s/tNxMSyh83c0BZMD
yaroslav (staff)
Staff
Posts: 2277
Joined: Mon Nov 18, 2019 11:11 am

Fri May 14, 2021 2:59 pm

The system asks username and password. Can you PM me the logs once again, please?
MeCJay12
Posts: 16
Joined: Fri May 07, 2021 1:25 pm

Fri May 14, 2021 8:10 pm

Fresh logs sent in PM
yaroslav (staff)
Staff
Posts: 2277
Joined: Mon Nov 18, 2019 11:11 am

Tue May 18, 2021 3:53 am

Thank you for the update. Rarely, StarWind VSS provider conflicts with Microsoft VSS and causes issues like yours.
Please remove StarWind Hardware VSS Provider and repeat the tests.
To remove it as administrator: "C:\Program Files\StarWind Software\StarWind\VSS\stop_.bat"
Found some misconfigs.
First, I noticed that you use L2 Cache. Please be aware that L2 cache boosts only reads. To make the best use of Flash, please consider re-using the L2 cache SSD for HA devices. See here how to remove cache https://knowledgebase.starwindsoftware. ... dance/661/.
Second, I noticed that HA devices are on C:\. Please be aware that we do not recommend using C:\ to avoid possible issues with HA devices on OS reinstall.
Four, please update StarWind VSAN. While updating, select the Server option. That should remove the VSS service.
Finally, are 192.168.2.31 and 172.16.0.2 are on the same physical card which makes your system vulnerable to split-brain. Also, you have vSwitch on the Sync adapter which is not recommended by us. I would suggest using an additional NIC for the heartbeat (can be used for anything else), and replacing the 192.168.2.x IP used for sync with a dedicated network for sync NOT THE MANAGEMENT ONE. Script here might be useful https://forums.starwindsoftware.com/vie ... f=5&t=5080. See more on networking redundancy at https://www.starwindsoftware.com/system-requirements
MeCJay12
Posts: 16
Joined: Fri May 07, 2021 1:25 pm

Wed May 19, 2021 5:20 pm

Thank you for the update. Rarely, StarWind VSS provider conflicts with Microsoft VSS and causes issues like yours.
Please remove StarWind Hardware VSS Provider and repeat the tests.
To remove it as administrator: "C:\Program Files\StarWind Software\StarWind\VSS\stop_.bat"
I'll try that and get back to you. By tests you mean a graceful shutdown of one node?
I noticed that you use L2 Cache. Please be aware that L2 cache boosts only reads. To make the best use of Flash, please consider re-using the L2 cache SSD for HA devices.
I have three devices and only one has Flash cache configured since that device is read heavy. The device is too large for my SSD space so it sounds like my approach is correct.
I noticed that HA devices are on C:\. Please be aware that we do not recommend using C:\ to avoid possible issues with HA devices on OS reinstall.
Can you elaborate? I don't see any devices on C. All of my virtual disks are on drive N or M. One of them is mounted via iSCSI to C:\ClusterStorage\Volume1 but that's because Hyper-V's failover cluster requires it.
please update StarWind VSAN. While updating, select the Server option. That should remove the VSS service.
Will do.
are 192.168.2.31 and 172.16.0.2 are on the same physical card which makes your system vulnerable to split-brain.


No, they are separate. The 172 is a 1Gb adapter on the mobo and is directly connected to the other node. The 192 adapter is a dual NIC PCI card. Each of it's NICs go to different switches (same L2). This is true for both nodes.
you have vSwitch on the Sync adapter which is not recommended by us. I would suggest using an additional NIC for the heartbeat (can be used for anything else), and replacing the 192.168.2.x IP used for sync with a dedicated network for sync NOT THE MANAGEMENT ONE.
Can you elaborate as to why mgmt traffic and sync traffic can't share an interface? The interface they share is a two NIC bundle so splitting the mgmts and sync traffic and splitting the NIC to the would introduce multiple SPOFs across both networks but could be done. It just seems backwards.
yaroslav (staff)
Staff
Posts: 2277
Joined: Mon Nov 18, 2019 11:11 am

Thu May 20, 2021 6:54 am

By tests you mean a graceful shutdown of one node?
Yes.
L2 cache boosts only reads. Hence, I believe it might be better if you put HA devices on that datastore instead. That is just my thought. I am not pushing you towards this change.
Can you elaborate? I don't see any devices on C. All of my virtual disks are on drive N or M. One of them is mounted via iSCSI to C:\ClusterStorage\Volume1 but that's because Hyper-V's failover cluster requires it.
While analyzing the headers, I noticed these files.
<storage id="2" name="My Computer\C\Flash-Games\Flash-Games.swdsk" type="device" lun="0x0">
<storage id="1" name="My Computer\C\Flash-Games\Flash-Games.img" type="file">
They are on C:\
No, they are separate. The 172 is a 1Gb adapter on the mobo and is directly connected to the other node. The 192 adapter is a dual NIC PCI card. Each of it's NICs go to different switches (same L2). This is true for both nodes.
192.168.2.30 MAC address 78-2B-CB-46-95-1E
172.16.0.1 MAC address 78-2B-CB-46-95-1F
This means that they are on the same adapter. I see no IP addresses in the headers, which means that everything is going from 1 card.
Can you elaborate as to why mgmt traffic and sync traffic can't share an interface? The interface they share is a two NIC bundle so splitting the mgmts and sync traffic and splitting the NIC to the would introduce multiple SPOFs across both networks but could be done. It just seems backwards.
Sync adapter cannot be used for anything else but sync. See more at https://www.starwindsoftware.com/system-requirements. DATA links in headers point to 192.168.2.30 and 31 IPs are vSwitch IP addresses.
MeCJay12
Posts: 16
Joined: Fri May 07, 2021 1:25 pm

Fri May 21, 2021 9:28 pm

VSS has been turned off. I was able to successfully reboot a node without the other crashing or any issues. Looks like this is resolved. I will still work on moving the Flash Cache off of C:\ but will keep it. I promise you that the 172 and 192 networks are different cards and I can't find those MAC addresses on the machines. I'll look into adding more NICs to the machine to move sync off of the mgmt NICs but I have to say this if there are no know downsides, I probably won't bend over backwards for this. Thanks for all the help!
yaroslav (staff)
Staff
Posts: 2277
Joined: Mon Nov 18, 2019 11:11 am

Sat May 22, 2021 8:52 am

You are always welcome! MAC could be different if machines were restarted. What you should try is running get-netadapter and checking the similar MAC addresses (only the last set is different).
Let us know here if more assistance is required.
Post Reply