PowerShell to recreate replica?

wallewek · Fri Oct 27, 2017 4:21 pm

Guys, sorry for interrupting your discussion, but I think you forgot about one thing, wallewek.
While copying, you have three files in StarWind folder, one *.img and *.swdsk, *.swdsk_HA, header file and headers different for both host.

I certainly welcome and appreciate your contribution, PoSaP.

I hadn't really "forgotten" about the potential differences between the host files. I was wondering how they affected the process. StarWind support seems to say storage replacement rebuild is a simple copy, but copying what, to where, is not clear to me.

Copying the actual contents of a VSAN virtual image seems both obvious and unnecessary, as synchronization handles that -- once it can be persuaded to start. But it seems we first need prepare the host storage properly before, sync will recognize it as a replica to be synced.

Copying the contents of the physical host storage being replaced, so that sync will accept it (reminds me of organ transplant rejection), raises a couple of key questions:

1. is it really necessary to copy complete image files (which appears to require the cluster to be down during that long copy), given that the replication/sync will immediately overwrite them? My attempts to avoid this have so far failed (and these were using the "failed" host's own files -- I haven't tried a cross-host copy yet), so I'm going to try a complete copy. I haven't decided yet from which host.

2. In a true hardware failure recovery situation, the original StarWind files would be lost on one host. Can we really manually copy the whole works from the other host? To me, that's what Boris's comments seem to imply. There's apparently no script, commands or documentation to deal with this, in the absence of the licensed StarWind management GUI. So your point is right on: are there, say, edits to the metadata that need to be done, if we copy files from one host to another? Boris's comments seem to say not. I've been meaning to look in them to see what might require it.

It's interesting that this doesn't appear to have been discussed here in any detail previously.

I've been taking the approach that one should be able to somehow replace a failed host using the remaining host, without using the StarWind GUI. Maybe I should give up on that whole approach, and make occasional host backups that I can use to restore each entire host, even though I'm already backing up all the guests individually. Talk about redundant redundancy.

PoSaP, if you have specific suggestions as to what post-copy metadata edits might be needed, would you mind posting them here?

-- Ken

wallewek · Sun Oct 29, 2017 12:47 am

Okay, it looks like I have a working process for replacing failed storage on one host, without using the SW GUI. And it requires zero cluster down time or iSCSI related commands.

Short summary, i had to copy/replace files at host level only, using (a) the metadata files from the failed server and (b) image files from the good server. i did not find any way to avoid copying image data twice, but at least it did not require cluster/guest downtime.

Note: in my case, the storage I was replacing had not yet actually failed, so i took a couple of extra steps to swap the live "f:" and replacement "r:" drives. I found that stopping the StarWind VSAN sevice while changing drive letters usually worked, else reboot.

Steps:
0. Before storage failure or drive replacement, make backup copies of the image folders and metadata (*.swdsk*) files on the target server. Personally, I plan to set up a scheduled task to do that daily on both servers. I used (e.g.) "robocopy f:\ c:\swbackup\ /mir /max:1000000" (/min or /max to filter for file type).
1. Attach the new storage to the failed server and partition it appropriately. I temporarily labelled mine "r:", that may not have been the best tactic.
2. Copy the folders and metadata files from backup to the new partition using the reverse of the above robocopy command.
3. Copy the image file(s) from the running server to the new storage. I used (e.g.) "robocopy \\kmhv4\f$ r:\ /min:1000000". Fortunately there's no file access conflict.
4. Check that the locations and sizes of the new StarWind folders and files match the other server. (They should, if I've typed this all correctly.)
5. Swap drive letters if required, restarting service.

Sync should restart (it did for me), completely updating the replaced drive's image.

Thanks to PoSaP for reminding me about the metadata files.

--Ken

wallewek · Sun Oct 29, 2017 7:48 am

Michael (staff) wrote:
It looks to me that, in a (Server 2016) two-host cluster, if I stop the StarWind VSAN service on one host, it automatically stops that service in the other host as well! Am i imagining things? Kind of makes me wonder how we are supposed to proceed, if we want to keep the cluster online while taking one host offline for maintenance.
Hello wallewek,
Could you please collect the logs from both node using StarWind Log Collector https://knowledgebase.starwindsoftware. ... collector/ and log a support case here: https://www.starwindsoftware.com/support-form ?

As for the file copying, PoSap is correct - you have to edit all StarWind configuration files (.swdsk) and StarWind config to restore HA manually. StarWind VSAN will do a Full synchronization in any case.

Sorry I missed this posting. I was mistaken about the other server's service stopping.

From what I've seen it's definitely not enough to copy and/or edit the swdsk* files. Sync will NOT start until there is a credible image file already in place. This is not at all well documented.

-- Ken

Tue Oct 31, 2017 10:03 am

Hello Ken,
I am happy that your test has been completed. As I wrote before, it is possible to restore HA if you have all files prepared and edited on a failed partner node. But your steps assuming copying data 2 times - firstly, you copied .img file to the failed server and second time StarWind did Full synchronization there.
It is much faster to follow the way which has been proposed by Boris when you create one more HA device, connect it via ISCSI, add it to the Cluster and then copy/move data (VMs) to the newly created storage. For example, storage migration can be used: http://blogs.msdn.com/b/clustering/arch ... 98203.aspx
I do realize, that it is not the ideal option, but it can be used while the script for replica recreation is under development.

wallewek · Tue Oct 31, 2017 2:46 pm

Thank you, Michael. But, although Boris has been very helpful elsewhere, in this case unfortunately it was completely unclear to me what he was suggesting. He gave no examples or references, and did not clarify when asked. I had no clue what he meant, until you expanded on it.

1. So far as I can see, my posting is the ONLY documentation anywhere on this site which describes, step by step, how to replace failed storage without using the GUI, and without completely replacing a non-redundant HA volume on the whole cluster, which would temporarily double storage requirements -- which _could_ involve down time to add, and for which I don't know if scripts exist either. You could make a KB article from it.

2. While my method is slow in elapsed time, it involves zero cluster down time, so that's of small concern.

3. As PoSaP points out, it should be possible to do this by cross-copying and editing the *.swdsk* files, rather than restoring backups. (Personally i think they _should_ be backed as standard practice, probably to the partner hosts, and you should recommend it.) But I cannot find any documentation on exactly what should be edited, and having the option (I was doing proactive replacement), I preferred not to guess. You should provide that info, too, IMHO.

It occurs to me that, while the VSAN itself is fully redundant, its metadata is not: the *.swdsk* files are unique to each host, and exist on only that host, in an area one would probably not back up, assuming it is being replicated. (I don't recall any KBs on host backup recommendations, either.) Something to think about?

-- Ken

Wed Nov 01, 2017 11:29 pm

Ken, thank you for your feedback and recommendations.
Unfortunately, StarWindX module is still in development but I believe we will do several KB articles for managing devices without using GUI.
As for the editing *.swdsk files, you should be very careful during this operation since any mistake could lead to device non-active state or even wrong synchronization. These files can be changed by StarWind service while it is running and every time each modification is unique (especially for HA file) and depend on the configuration. I hope this explains why there is no sense to backup .swdsk files since you can backup a wrong state of HA device. So, I would suggest to leave it for StarWind service.

wallewek · Thu Nov 02, 2017 3:05 pm

Thank you very much, Michael. That explains a lot.

FWIW, I have not been able to find any KB articles or technical papers describing non-GUI or script-based management. All I have found are in this forum and in the download itself.

-- Ken

Mon Nov 06, 2017 10:14 am

Hello Ken,
I believe it will be done once StarWindX module will be completed. I am sorry for any inconveniences it caused.
Anyway, we are always ready to help you here!

zendzipr · Fri Jun 15, 2018 5:17 pm

Sorry to re-activate an old thread, but this is the only one I have been able to find that comes close to dealing with a node failure and recovery.

I am attempting to test a total node failure which requires full rebuild of the operating system and installation of all software (starwind, etc)

I have yet to find any details on how to do this. Some of what is described here appears helpful, however so far, nothing works.

using GetHASyncState.ps1 looks useful, however no matter how I test, similar failed results come back.

Host information
node1 10.252.37.2
node2 10.252.37.3

Running GetHASyncState.ps1 from the system with data

$server = New-SWServer -host 127.0.01 -port 3261 -user root -password starwind
$partnerTargetName = "iqn.2008-08.com.starwindsoftware:10.252.37.103-vmm"

Results

Code: Select all

PS C:\Users\Administrator\Desktop\SAN Manager> .\syncStatus.ps1
HAImage1

Device not synchronized. Synchronize current node from partner 'iqn.2008-08.com.starwindsoftware:10.252.37.103-vmm'
Exception Error: 
200 Failed: connection with partner node is invalid..

Alternately, when doing the same thing from node 2, I get error HAImage1 not found.

Is there a step by step procedure for recovering from a total node failure?

Mon Jun 18, 2018 11:26 pm

Currently, no PowerShell functionality is available for what you are trying to achieve. If a node fails completely, be sure to backup the StarWind configuration file (StarWind.cfg) from the StarWind installation folder unless your disk subsystem fails completely. When you redeploy the OS and install StarWind VSAN there again, just introduce the backup config there after stopping the StarWind service. Thus it will recognize all existing settings and replicas for the devices. If you encounter disk failure and need to recreate replicas for the devices from other node9s), contact StarWind Support. Be sure to check for availability of this option before you do so, as it is planned to be introduced in one of the upcoming builds.