Best practice for cluster recovery

StarWind VTL, VTL Free, VTL Appliance

Moderators: anton (staff), art (staff), Max (staff), Anatoly (staff)

Post Reply
adminr
Posts: 16
Joined: Tue Jan 14, 2025 12:11 pm

Thu Feb 13, 2025 2:45 pm

Hello,

I have proceed crash-tests for StarWind and I have faced a problem. I need solution for future similar situation.
How I have got the problem.
1. I moved vCenter to StarWind storage to extend SW volume capacity. Done. works fine.
2. I shut a CVM down to extend HDD of CVM. vCenter frizzed after that and one of ESXi is getting glitching.
3. I rebooted ESXi . all VMs works on second ESXi node. Web GUI on the first ESXi was not available. The reason of problem with img file.
It was a long way to resolve the gui problem. finally I disabled Sync and Replication links and then I got a chance to reach the img file.
previously if you try to ls /vmfs/volume ESXi console is suspended immediately.

Current condition
1. Physical disks - looks the same for both nodes in SW Web GUI
2. Storage pools - the same
3. Volumes - the same
4. LUN presents but "Limited availability"
5. Network All Links have status is "Up". Even though I got down NICs Sync and Replication on second Node.
6. All servers including vCenter are running on the second node. SW LUN is available for second node. on first node only CVM is running.
7. on ESXi Devices tab for the first node system doesn`t have StarWind disk for now. I think because of broken img is located on the CVM.

Is this a good way to remove manually unecessary or broken img files from /mnt/volume0 and from /opt/starwind/../../headres/LUN_Name ? f PowerShell script doesn`t remove this files .

What is the strategy in this case to restore SW Cluster ?
Support bundle for both nodes https://dropmefiles.com/AKwNv
yaroslav (staff)
Staff
Posts: 3334
Joined: Mon Nov 18, 2019 11:11 am

Thu Feb 13, 2025 3:53 pm

It looks like you did not check paths to be up.
The reason of problem with img file.
I'm sorry. I don't follow. Could you please rephrase and tell me more on why you think StarWind *img inside the VM to be a problem for ESXi host?
Sync and Replication
Do you mean Sync = Replication. Do you mean Data and Sync?
4. LUN presents but "Limited availability"
That's expected. Let synchronization finish, please.
. on ESXi Devices tab for the first node system doesn`t have StarWind disk for now. I think because of broken img is located on the CVM.
If IMG is broken it will not show up on any machine.
Please make sure that
1. Data link is up
2. Set MTU to 1500 in ESXi and VM.
Is this a good way to remove manually unecessary or broken img files from /mnt/volume0 and from /opt/starwind/../../headres/LUN_Name ? f PowerShell script doesn`t remove this files .
I doubt the image needs recreating if you are heading toward full synchronization anyway. The "healthy" image will overwrite the FS on the affected one.
1. Remove the replica to the affected node using RemoveHAPartner.ps1. If you did that, that's great.
2. Connect to the affected VM over ssh.
3. Remove the files from the /mnt/path/to/file
4. Remove files from /headers.
5. Recreate replica.

p.s. dropmefiles looks to be a russian resource. Be careful with your data.
adminr
Posts: 16
Joined: Tue Jan 14, 2025 12:11 pm

Fri Feb 14, 2025 10:46 am

I'm sorry. I don't follow. Could you please rephrase and tell me more on why you think StarWind *img inside the VM to be a problem for ESXi host?
As only I had got the problem and restart ESXi , I lost access to ESXi web GUI but SSH access was available. Even RAID Manager plug-in for ESXi was working well.
during research I found the source of problem. ESXi folder /vmfs/volumes. if I try to get the contents of the folder using the "ls" command, the ESXi console gets stuck after that.
I tried some commands like these. they was helpful on only twice and always finished with error. device busy and etc
localcli iscsi session remove --adapter=vmhba64
localcli iscsi software set --enabled=false

and I got a chance to login in Web GUi of ESXi , but Storage - Device Tab was also suspend. I could`t get a list of devices. any interactions with iSCSI and data stores got ESXi critical errors. finally I set to disable state two links on second node Data/Sync and Replication, shout the CVM down, umount glitches devices and datastore, then I able to log in to Web GUI of ESXi without any issues.
yaroslav (staff)
Staff
Posts: 3334
Joined: Mon Nov 18, 2019 11:11 am

Fri Feb 14, 2025 11:27 am

I lost access to ESXi web GUI
This likely related to ESXi not the the VM running in it.
Please see the suggestions from my previous post. This might be a result of something going sideways on the Initiator layer.
adminr
Posts: 16
Joined: Tue Jan 14, 2025 12:11 pm

Fri Feb 14, 2025 2:29 pm

This likely related to ESXi not the the VM running in it.
Yes, technically you're right but the device was a part of CVM and both CVM nodes was proceeding communication via Data link and Replication link . When I fix it , I`ll give you a feedback.
adminr
Posts: 16
Joined: Tue Jan 14, 2025 12:11 pm

Sun Feb 16, 2025 10:20 am

I had have set NIC`cup and cluster have got healthy condition. imgs for both nodes have synchronized.
but I still have a question from other topic. How can I shrink a Device Size in case incorrect capacity was set ?
StarWind is ready to accept any size despite the available storage size. Extend_Device.ps1
Attachments
device_size.png
device_size.png (4.13 KiB) Viewed 4386 times
yaroslav (staff)
Staff
Posts: 3334
Joined: Mon Nov 18, 2019 11:11 am

Sun Feb 16, 2025 11:28 am

ESXi should handle that. CVM does nothing extraordinary from the hypervisor perspective but publishing the storage.
Technically, you can shrink the device, but I'd rather not to do that because of uncertainty where the data is on the file system and whether cutting the slice of it affects the data written into it. Defrag should do the trick, yet the stakes are high.
Post Reply