Some HA Disks Not Able to Resync After Reboot

Software-based VM-centric and flash-friendly VM storage + free version

Moderators: anton (staff), art (staff), Max (staff), Anatoly (staff)

MeCJay12
Posts: 16
Joined: Fri May 07, 2021 1:25 pm

Fri Nov 26, 2021 4:27 am

Hello! I have a 2 node hyper-converged Hyper-V cluster running with a vSAN backend. After a recent reboot, 2 of my four HA disks aren't able to resync. They will start and fail within 10 minutes. The other two disks were able to resync without an issue. The two functioning disks are for Hyper-V's witness disk ("Witness") and a cluster shared volume for a cluster file server ("Data") and the two failing disks are a file share used on a different machine ("Games") and a disk for the cluster VMs ("VMs"). What can I look into to get these disks resynced? My two servers have each has 2x10G links directly to each other for sync and 2x10G links to the network for heartbeat, Internet, etc.
yaroslav (staff)
Staff
Posts: 2361
Joined: Mon Nov 18, 2019 11:11 am

Fri Nov 26, 2021 5:03 am

Hi that is called "fast" synchronization and is expected to happen on StarWind VSAN host restart. Fast sync is the process that synchronizes the data to the "offline" server from the "active" one. It is necessary to ensure data is the same on both nodes.
MeCJay12
Posts: 16
Joined: Fri May 07, 2021 1:25 pm

Sun Nov 28, 2021 4:26 am

Hey yaroslav, thanks for the response. Both of the devices that are failing are trying to do a full sync. They estimate that it will take ~90 minutes but fail after 5-10 minutes. They sit idle for a while then try again. My other disks were able to do a fast sync and recover as you described.
yaroslav (staff)
Staff
Posts: 2361
Joined: Mon Nov 18, 2019 11:11 am

Mon Nov 29, 2021 8:41 am

Hi,

Please share the logs with me. Collect the logs from both servers as described here https://knowledgebase.starwindsoftware. ... collector/.
Share them via Google Disk, OneDrive, Sharepoint, etc.
MeCJay12
Posts: 16
Joined: Fri May 07, 2021 1:25 pm

Mon Nov 29, 2021 3:06 pm

Logs uploaded to Google Drive. Link DM'd to you.
MeCJay12
Posts: 16
Joined: Fri May 07, 2021 1:25 pm

Fri Dec 03, 2021 3:22 am

Bump
yaroslav (staff)
Staff
Posts: 2361
Joined: Mon Nov 18, 2019 11:11 am

Fri Dec 03, 2021 7:41 am

Hi,

Thanks for your patience, still need slightly more time to check the logs.
yaroslav (staff)
Staff
Posts: 2361
Joined: Mon Nov 18, 2019 11:11 am

Fri Dec 03, 2021 10:57 am

Both hosts have incorrect Discovery settings for local targets. See more here https://www.starwindsoftware.com/resour ... rver-2016/ under Provisioning StarWind HA Storage to Windows Server Hosts.
Old build, please update.
I also see huge delays for iqn.2008-08.com.starwindsoftware:hvlabb.ad.cshaheen.tech-games4 on B, the synchronized node. You may also need to restart the service on B. Please take a backup of VMs on the affected CSV -> stop the VMs using the affected CSV -> restart the service on B -> observe synchronization.
If B ends-up getting out of synchronization too (mutual not synchronized state on A and B for the affected device), on B, try C:\Program Files\StarWind Software\StarWind\StarWindX\Samples\powershell -> run SyncHaDevice.ps1 for HAImage5.

Please also note that your system has an unsupported network configuration: no dedicated physical Synchronizaton link and mixing of iSCSI and Management traffic. This being said, I also doubt if this deployment meets our best practices https://www.starwindsoftware.com/best-p ... practices/.
MeCJay12
Posts: 16
Joined: Fri May 07, 2021 1:25 pm

Sat Dec 04, 2021 1:43 am

I've updated the Discovery settings.

Updated the build. Now I'm getting an error about console and service being different versions.

I tried rebooting B and that didn't help the sync.

I ran the script on B. The output was `Device HAImage5 is synchronized` though it was already synchronized.
yaroslav (staff)
Staff
Posts: 2361
Joined: Mon Nov 18, 2019 11:11 am

Sun Dec 05, 2021 9:14 am

Now I'm getting an error about console and service being different versions.
This is not an error, this is a warning which can be ignored in most cases. Furthermore, you can update the console as well to make it go away.
Full synchronization must start on service restart on B. Did it finish?
MeCJay12
Posts: 16
Joined: Fri May 07, 2021 1:25 pm

Mon Dec 06, 2021 4:20 am

No, synchronization continues to fail.
yaroslav (staff)
Staff
Posts: 2361
Joined: Mon Nov 18, 2019 11:11 am

Mon Dec 06, 2021 4:46 am

I see huge delays for that volume. Did you restart the service on the active side the way I suggested?
Also, did StarWind VSAN change the synchronization type to Full synchronization?
I see huge delays coming from the underlying storage. Could you kindly tell me what the underlying storage configuration is, please?
Finally, can I have the updated logs, please?
MeCJay12
Posts: 16
Joined: Fri May 07, 2021 1:25 pm

Mon Dec 06, 2021 5:28 pm

I followed your steps except rather than restarting the service, I rebooted the whole node. I can try again if you note which services exactly to restart.

The synchronization has always been Full (as far as I know).

The underlying storage is 5 x Samsung SSD 860 EVO 4TB drives in a RAID 0 using a Dell H710 mini.

Fresh logs added to the same Google Drive link as before.
yaroslav (staff)
Staff
Posts: 2361
Joined: Mon Nov 18, 2019 11:11 am

Mon Dec 06, 2021 7:56 pm

I will check the logs in the good order.
Also, we do not recommend RAID0. Please see the recommended settings at https://knowledgebase.starwindsoftware. ... ssd-disks/
Full synchronization is expected because the actove node was restarted as a part of troubleshooting.

I'd like to draw your attention to the fact that the network configuration of your setup does not meet StarWind VSAN system requirements https://www.starwindsoftware.com/system-requirements.

Will keep you posted on log investigation.
yaroslav (staff)
Staff
Posts: 2361
Joined: Mon Nov 18, 2019 11:11 am

Fri Dec 10, 2021 9:16 am

I can see this event on B every time the synchronization drops
12/6 12:09:12.567314 1e3c IMG: *** ImageFile_IoCompleted: Disk operation failed. Disk path: D:\VMs4\VMs4.img. Error code: (1).
Please try recreating the replica by using RemoveHAPartner and AddHAPartner. Run these scripts from the healthy node. Make sure to have Device priorities the same for all devices on one node. See more on priorities here https://forums.starwindsoftware.com/vie ... f=5&t=5731.
If you have no data on that HA device, try recreating it from scratch.
Alternatively, you can create a new IMG, migrate the data there and replicate it. Please note that the iSCSI target goes unavailable for a brief moment during converting .img to the HA device.
Also, note that you are running an unsupported network configuration for this system.
Post Reply