How to replace non-failed VSAN storage

wallewek · Mon Oct 19, 2020 10:40 pm

This is a two-server Hyper-converged Microsoft Failover Server cluster.

I have an HA CSV drive that is working now, but based on hardware monitoring I expect it to fail soon, so I have a replacement coming. Is there StarWind recommended best practice for replacing working storage devices?

I've seen recommendations on replacing devices that have already failed, but that doesn't apply here.

So, assuming it doesn't make sense to try to leave one server running while I shut down the other one, if I was going to guess, I would go like this:

1. Shut down all cluster VMs and nodes, and shut down the cluster services, and disable the cluster services on both hosts..
2. Put all StarWind devices into maintenance mode, stop the VSAN service and disable it on both hosts.
3. Shut down server with the failing drive and attached new drive.
4. Boot that server and copy the complete contents from the old drive to the new one by Windows copy, robocopy or something similar.
5. Swap drive letters so the new drive takes over from the old drive.
6. Remove the old drive.
7. Boot the server with the new drive in place.
8. Re-enable StarWind services
9. Check the things look OK in the console
10. Disable Maintenance Mode.
11. If it decides to resynchronize, wait until that stops.
12. Re-enable Microsoft Failover Cluster services and start them up.
13. Deal with whatever shows up.

I'm thinking maybe before I do this, I should pull at least one of my DCs out of the cluster. I haven't tried that before, going to have to see how it's done. Must be more to it than just taking the storage out of the CSV.

If you already have a document that covers this, I don't need you to comment on all of those steps unless you think it's worthwhile.

--- kenw

Tue Oct 20, 2020 3:25 am

Hey,

The procedure you shared can be performed without the downtime (i.e., Maintenance mode). You can have the production running while the other serer undergoes the maintenance as you do not need to turn off the other server.
1. Make sure to have a backup. Copy StarWind.cfg too from C:\Program Files\StarWind Software\StarWind just in case.
2. In the StarWind Management Console, check that all StarWind HA devices have the “Synchronized” status on all servers;
3. Move all Cluster resources and non-clustered VMs from the server (you can just shut down non-clustered VMs too), which is going to undergo maintenance (pause and drain the roles in the cluster);
4. Disable StarWind Service on the host that undergoes the maintenance;
5. Swap the disk;
6. Wait for the rebuild to finish.
7. Set StarWind Service to Autostart and start it again;
8. Let sync go.

Speaking of DC, you can migrate the VMs one by one off the shared storage.

wallewek · Wed Oct 21, 2020 5:06 pm

Thank you Yaroslav, that's really helpful!

--- kenw

Wed Oct 21, 2020 5:59 pm

You are always welcome.

wallewek · Sun Oct 25, 2020 8:07 pm

Just a comment, it may seem trivial and obvious, but...

I think it's really cool how the Microsoft Failover Cluster software continues to see all Cluster Shared Volume (CSV) volumes as up and running, and continues to host the cluster virtual machines up and running, even when one of the StarWind VSAN hosts is down for maintenance or whatever.

It goes to show that the StarWind VSAN is a true virtual SAN, and that its Storage Area Network (SAN) service sits fully below and hidden from the Failover Cluster in a Hyper-Converged cluster. Sweet.

--- kenw

Mon Oct 26, 2020 4:22 am

Even if one server is down, storage is available from the partner as targets are connected over iSCSI from both servers.

Really glad that you enjoy the solution do not hesitate to contact us if any assistance is required.

wallewek · Fri Jan 05, 2024 11:35 pm

yaroslav (staff) wrote:The procedure you shared can be performed without the downtime (i.e., Maintenance mode). You can have the production running while the other serer undergoes the maintenance as you do not need to turn off the other server.
1. Make sure to have a backup. Copy StarWind.cfg too from C:\Program Files\StarWind Software\StarWind just in case.
2. In the StarWind Management Console, check that all StarWind HA devices have the “Synchronized” status on all servers;
3. Move all Cluster resources and non-clustered VMs from the server (you can just shut down non-clustered VMs too), which is going to undergo maintenance (pause and drain the roles in the cluster);
4. Disable StarWind Service on the host that undergoes the maintenance;
5. Swap the disk;
6. Wait for the rebuild to finish.
7. Set StarWind Service to Autostart and start it again;
8. Let sync go.

I just tried this shortcut method -- yes, over 3 years later -- because a working drive was showing poor health status, and... well, it didn't work.

It didn't even start to synchronize, and I couldn't find a way to make it. I got messages on the failed volumes about being not licensed. I'm not sure it even "saw" the new bare drive. I tried using SynchHaDeviceAdvanced,ps1, it failed because System.__COMObject does not contain a method named "MarkAsSynchronized".

I suspect it has something to do with iSCSI, not sure. The replacement drive was partitioned the same way, had the same drive letter, but I named it differently. Would that matter?

I have a feeling there's a lot implied but unstated in "5. Swap the disk". Like, cloning/copying all the data from the old disk to the new one (my previous step 4) before connecting as the replacement?

Anyway, I'm trying that now.

Sat Jan 06, 2024 12:00 am

Disk name and number can be different. The letter thought should be better the same so that StarWind VSAN could recognize it.
If that's not-RAID configuration, swapping the disk implies that there is no data on that disk.
Yes, please copy the files just in case from the disk. For RAID deployments, replacing the disk typically does not involve copying the data, but for a single disk, copying data will be crucial.

Let me know how it goes.

wallewek · Sat Jan 06, 2024 5:32 am

You're right, yaroslav, this is a non-RAID setup. I'm deliberately pushing it, running as lean as possible, and using the VSAN as my RAID.

Over the years I've been so conditioned and accustomed to all the hardware redundancy of "real server iron", I could hardly conceive of it. it took me a while to wrap my head around this approach to server redundancy, but it seems legit. I mean, at some point, you have to stop piling redundancy on redundancy. Adding complexity increases both cost and things that can fail. And even though I have a VSAN member down, my cluster keeps on ticking.

Update to follow.

Sat Jan 06, 2024 5:42 am

Got it. Thanks for your update. That makes sense, but those are slightly different approaches: RAID grants redundancy within the box, while VSAN helps to achieve cross-server redundancy. Also, StarWind VSAN should not be viewed as a backup solution unless tailored to that need with us.
Keep me posted and have a nice weekend!

wallewek · Sun Jan 07, 2024 1:42 am

Oh I agree about RAID vs backups. I run full backups to a separate BDR system with its own internal RAID plus swapped storage for offsite.

The VSAN is back online, after using Windows to copy content from the old drive to the new one. I had to stop/kill the VSAN service on both hosts to get it to restart synchronization. I suspect that might not have been necessary if I hadn't tried to restart synch with the empty drive first, I don't know. Not quite sure what the bare minimum drive configuration is before synch can start.

So, it worked. Probably not a recommended strategy for a production environment, but a good proof of concept.

If the drive had flat failed, I wonder if I could have gotten away with cloning or even just copying the drive from the other host...

Sun Jan 07, 2024 2:16 am

Thanks for your update.

I had to stop/kill the VSAN service on both hosts to get it to restart synchronization.

Restarting StarWind VSAN only on the affected server should be sufficient.

If the drive had flat failed, I wonder if I could have gotten away with cloning or even just copying the drive from the other host...

You could have just recreated the mirror to that drive and waited for full synchronization to complete
The procedure is pretty much Remove Replica -> Recreate the replication partner.

wallewek · Sun Jan 07, 2024 5:36 am

yaroslav (staff) wrote:Thanks for your update.
I had to stop/kill the VSAN service on both hosts to get it to restart synchronization.
Restarting StarWind VSAN only on the affected server should be sufficient.

I tried that first.

If the drive had flat failed, I wonder if I could have gotten away with cloning or even just copying the drive from the other host...
You could have just recreated the mirror to that drive and waited for full synchronization to complete
The procedure is pretty much Remove Replica -> Recreate the replication partner.

Ah. Not sure how to do that with PowerShell, and don't need to now. But thanks so much for your help!

Oh, hey, I've done some PS tweaks I don't mind sharing, if you're interested. One of them is a script for Windows Cluster-aware updating, that checks to make sure the VSAN is synchronized before letting host servers reboot. Is that of any interest?

Sun Jan 07, 2024 1:06 pm

Hi,

Yes, indeed! I guess that one will be helpful for the community.
Thanks a lot.

wallewek · Mon Jan 08, 2024 1:02 am

yaroslav (staff) wrote:Hi,

Yes, indeed! I guess that one will be helpful for the community.
Thanks a lot.

Sorry, I don't remember: What's the best way to share stuff like that? Just paste it in-line as code? That would make easy to include notes...