How to get maintenance status

Software-based VM-centric and flash-friendly VM storage + free version

Moderators: anton (staff), art (staff), Max (staff), Anatoly (staff)

wallewek
Posts: 114
Joined: Wed Sep 20, 2017 9:13 pm

Thu Oct 15, 2020 8:26 pm

I appreciate the new MaintenanceMode powershell script, but It leaves me with a question or two. Unfortunately the StarWindX PDF documentation appears to be very out of date.

Is there any "get" version of the PowerShell "SwitchMaintenanceMode" function that it uses? It would really be nice to be able to list the maintenance status of devices.

Is there any way to list all the PowerShell functions inside the StarWindX module? I don't mind writing my own PowerShell scripts if I had some sort of list of what I could use. I tried using Get-Command, but SwitchMaintenanceMode doesn't appear to even be listed.

Thanks if you have any suggestions!

--- kenw
------------------------
"In theory, theory and practice are the same, but in practice they're not." -- Yogi Berra
yaroslav (staff)
Staff
Posts: 2279
Joined: Mon Nov 18, 2019 11:11 am

Fri Oct 16, 2020 7:52 am

May I help you with your queries about the MaintenanceMode script?
There is already a sample script that you can fit your individual needs. Yes, it is a sample script that you are welcome to modify as you wish (unless it breaks things of course).
Are you looking for the list of functions or scripts inside the StarWindX module? It will be cool to have a script from you of course :D , but if you request a documentation update, that's something that we should do, I believe.
So my questions are:
1. how can I help you with the MaintenanceMode script?
2. do you request a documentation update?
wallewek
Posts: 114
Joined: Wed Sep 20, 2017 9:13 pm

Fri Oct 16, 2020 3:42 pm

Thank you very much for your response, Yaroslav. I'm sorry, I guess I wasn't very clear.

The biggest problem with the MaintenanceMode script, as I see it, is its lack of reporting. It provides absolutely no way to report on the maintenance mode of any device without changing it.

That's what I want for Maintenance Mode status: a way to GET it, not set it.

As for the documentation, yes, I do request a documentation update. It doesn't need to be verbose, necessarily, but it would be very helpful if it at least fully covered the complete list of set and get functions.

I have been through StarWindX from beginning to end, and appreciate it very much, but it doesn't even provide the bare essential information necessary to get maintenance mode. I would happily add it to the GetHASyncState script I've provided here before, but I can't, without that info.

--- kenw
------------------------
"In theory, theory and practice are the same, but in practice they're not." -- Yogi Berra
yaroslav (staff)
Staff
Posts: 2279
Joined: Mon Nov 18, 2019 11:11 am

Fri Oct 16, 2020 4:15 pm

Hi,

Thanks for your hint. Will request better documentation on StarWindX, and, guess a new script for checking if Maintenance mode is enabled.
By the way, you can check it in StarWind Management Console as the free version provides the monitoring capabilities.
wallewek
Posts: 114
Joined: Wed Sep 20, 2017 9:13 pm

Fri Oct 16, 2020 11:03 pm

Thank you Yaroslav.

Yes, I did not expect to be able to use the StarWind Management Console to check maintenance mode. It does work nicely. That kind of makes using a script to check statuses less important. I have made a bunch of changes to the MaintenanceMode script to make it more clear what is happening, etc. I will provide a copy soon.

However, I now have a problem, after replacing a failing drive (it had not yet failed but diagnostics said it soon would). The replacement seemed to be to be working OK, but now after final mounting in the server (I had it mounted externally for testing) I have a cluster problem. The Microsoft Failover Cluster Manager refuses to bring the CSV drive online, citing an error 1460. And I was very careful to use Maintenance mode before shutting down to do that, too.

The network statuses and iSCSI stuff all look OK too.

I can't see anything wrong with the drive, chkdsk thinks it's fine. However, I'm suspicious of the synchronization status. Using PowerShell, I'm getting HAstatus results that say that the devices are all synchronized, but that the synch percentage is zero -- on some devices, not all.

What's weird is that the StarWind Management Console says the drives are just fine. Quite confusing. I'm quite concerned about PowerShell and the Management console not reporting the same thing.

I'm trying a full one-server-at-a-time reboot without maintenance mode. So far, there's a full resynch taking place on the CSV, but it still refuses to come on line in the Microsoft FCM Not sure how to troubleshoot that.

-- kenw
------------------------
"In theory, theory and practice are the same, but in practice they're not." -- Yogi Berra
wallewek
Posts: 114
Joined: Wed Sep 20, 2017 9:13 pm

Sun Oct 18, 2020 9:50 pm

OK, the Failover Cluster is back online. I'm really not sure what the real problem was; diagnostics were quite unclear. I suspect something mucked up with the iSCSI targets. It probably doesn't help that the host servers are both domain members, but the domain controllers are virtual guests of the cluster, so can't be reached while bringing up cluster services if the whole cluster is offline.

One thing is consistent, though: PowerShell reporting of device synchronization percentage continues to be... basically broken.
Devices that report as Synchronized in both PowerShell and the StarWind Management Console are showing 0% synchronized. Interestingly, it seems to wander around a bit as to which devices report this way. It's gotten to the point where I basically ignore the PowerShell output; I simply can't trust it.

While I'm here, I'm going to paste in a couple of PowerShell scripts I've been working on for improved friendliness. Do with them what you will.

First, here's the latest GetHASyncState script. It handles everything but device naming, and shows both sides of a two-node VSAN.

Code: Select all

#
# Get synchronization status of specified HA device and if there is a need run synchronization
#

"GetHASYncState -- KW version"

Import-Module StarWindX

while ($true) {

"------------- Running ---------------"

try
{
    #
    # connect to the server
    #
    $server = New-SWServer -host 127.0.0.1 -port 3261 -user root -password starwind

    $server.Connect()

    #
    # Try to find specified device
    #
    $deviceName = "*"
    $partnerTargetName = "*"
    $deviceFound = $false
    
    foreach($device in $server.devices)
    {
            $device.Refresh()
            
            $syncStatus = $device.GetPropertyValue("ha_synch_status")

            Write-Host ""
            Write-Host -NoNewline "DeviceName:" $device.GetPropertyValue("DeviceName") -foreground yellow

            If ( $syncStatus -gt "" )
            {
                $state = $device.GetPropertyValue("state")
                
                Write-Host " -- TargetName:" $device.GetPropertyValue("TargetName") 
                Write-Host -foreground yellow -NoNewline "Device state:" $state 
                
                $stateText = switch ($state)
                {
                    0 { " (Active)" }
                    1 { " (NonActive)" }
                    2 { " (NotLicensed)" }
                    3 { " (Disabled)" }
                    default { "(Undefined)" }
                }

               Write-Host $stateText  -foreground yellow

               Write-Host -NoNewline "Sync Status:" $syncStatus "" -foreground yellow

               $waitForAutoSync = $device.GetPropertyValue("ha_wait_on_autosynch")
                
                if ( $waitForAutoSync -eq "1" )
                {
                    Write-Host "(Waiting for autosynchronization)" -foreground yellow
                }
                else
                {
                    if ( $syncStatus -eq "1" )
                    {
                        #
                        # Device is synchronized. Get synchronization percent and show it
                        #
                        $syncPercent = $device.GetPropertyValue("ha_synch_percent")                        
                        Write-Host "(Synchronized) -- $($syncPercent)%" -foreground yellow

                    }
                    if ( $syncStatus -eq "2" )
                    {
                        #
                        # Device is synchronizing. Get synchronization percent and show it
                        #
                        $syncPercent = $device.GetPropertyValue("ha_synch_percent")
                        Write-Host "(Synchronizing) -- $($syncPercent)%" -foreground yellow
                    }
                    
                    if ( $syncStatus -eq "3" )
                    {
                        #
                        # Device not synchronized. Synchronize current node from partner
                        #
                        Write-Host "Device not synchronized."
                        
                        # Synchronize current node from partner '$($partnerTargetName)'" -foreground yellow

                       # $params = new-object -ComObject StarWindX.Parameters        
                       # $params.AppendParam("deviceID",$device.DeviceId)
                       # $params.AppendParam("partnetTargetName",$partnerTargetName)
                        
                       # $server.ExecuteCommand( 0, "restoreHAPartnerNode", $params)
                        
                        #
                        # If you want to synchronize partners from current node you can comment out code above and uncomment section below
                        #
                        
                        # Device not synchronized. Mark current node as 'Synchronized'. 
                        # WARNING, Command changes Device Status to "Synchronized" without Data Synchronization with HA (High Availability) Partner, 
                        # Device will start processing Client Requests immediately and will be used as Data Synchronization Source for Partner Device.
                        #
                        #Write-Host "Device not synchronized. Mark current node as 'Synchronized'. " -foreground yellow

                        #$params = new-object -ComObject StarWindX.Parameters        
                        #$params.AppendParam("deviceID",$device.DeviceId)
                        
                            #$server.ExecuteCommand( 0, "restoreCurrentHANode", $params)
                        
                       # Start-Sleep -m 5000
                    }
                }
#=======
                Write-Host "ha_partner_nodes_count:" $device.GetPropertyValue("ha_partner_nodes_count")

                $status = $device.GetPropertyValue("ha_partner_node1_sync_status")
                $statusText = switch ($device.GetPropertyValue("ha_partner_node1_sync_status") )
                {
                    1 { " (Synchronized)" }
                    2 { " (Synchronizing)" }
                    3 { " (NOT Synchronized)" }
                    default { "(Undefined)" }
                }
                Write-Host "ha_partner_node1_sync_status: $($status)$($StatusText) --" `
                            $device.GetPropertyValue("ha_partner_node1_sync_percent"), "%"
                Write-Host "ha_partner_node1_is_exist_sync_valid_connection:" `
                            $device.GetPropertyValue("ha_partner_node1_is_exist_sync_valid_connection")
                Write-Host "ha_partner_node1_is_exist_heartbeat_valid_connection:" `
                            $device.GetPropertyValue("ha_partner_node1_is_exist_heartbeat_valid_connection")
                
               # Write-Host "ha_partner_node2_sync_status:" $device.GetPropertyValue("ha_partner_node2_sync_status")
               # Write-Host "ha_partner_node2_sync_percent:" $device.GetPropertyValue("ha_partner_node2_sync_status")




            } 
            else
            {
                write-host -NoNewLine " -- no info"
            }
    }
    
#    if ( $deviceFound -ne $true )
#    {
#        Write-Host "$($deviceName) not found" -foreground red
#    }

}
catch
{
    Write-Host "Exception $($_.Exception.Message)" -foreground red 
}

Write-Host ""

$server.Disconnect( )
pause

}

Second is MaintenanceMode-Choose.ps1, which allows setting or unsetting maintenance mode, with or without Force option. It uses a GUI dialog box to chose the options, so no code editing is needed for that, but you do have to edit it initially for your device names. It doesn't tell you what mode the VSAN is already in, as I know of no way to do that, but it is at least clear about what it does and tells you where to check.

Code: Select all

#modified KW
#NOTE, running on ANY node sets Maintenance mode for a given device on ALL nodes.
#Cluster services should be fully stopped BEFORE using maintenance mode. and all devices should be put into maintenance mode together

param(
    $addr="127.0.0.1", $port=3261, $user="root", $password="starwind", `
    $deviceName1="HAImage1", `
    $deviceName2="HAImage2",`
    [string]$enable=$true,`
    [string]$force=$true
    )


Import-Module StarWindX

switch( $Host.UI.PromptForChoice("VSAN Mode Selection",`
        "Set StarWind Maintenance Mode", `
        @("Enable"; "Enable with FORCE"; "Disable"; "Cancel" ),3) ) 
    {
    0 { $enable=$true
        $force=$false
        Write-Host "Enable selected WITHOUT Force" }
    1 { $enable=$true
        $force=$true
        Write-Host "Enable selected WITH Force" }
    2 { $enable=$false
        Write-Host "Disable selected" }
    3 { Write-Host "Cancelled -- no action taken"
        exit }
    }


Pause

try
{
	Enable-SWXLog

	$server = New-SWServer $addr $port $user $password

	$server.Connect()

#device 1
	$device = Get-Device $server -name $deviceName1
	if( !$device )
	{
		Write-Warning "Device {$deviceName1} not found"
		return
	}

	#params: enable, force
    $device.SwitchMaintenanceMode([bool]::Parse($enable), [bool]::Parse($force))

    Write-Host $deviceName1 "maintenance mode enablement now:" $enable


#device 2
    $device = Get-Device $server -name $deviceName2
	if( !$device )
	{
		Write-Warning "Device {$deviceName2} not found"
		return
	}

	#params: enable, force
    $device.SwitchMaintenanceMode([bool]::Parse($enable), [bool]::Parse($force))

    Write-Host $deviceName2 "maintenance mode enablement now:" $enable

}
catch
{
	Write-Error $_
	exit 1
}
finally
{
#	$server.Disconnect()

    Write-Host "Maintenance mode status may be verified in the StarWind Management Console"


}

Neither is guaranteed bug-free, of course. It would actually be kinda nice if there were a better way to share scripts here.
------------------------
"In theory, theory and practice are the same, but in practice they're not." -- Yogi Berra
yaroslav (staff)
Staff
Posts: 2279
Joined: Mon Nov 18, 2019 11:11 am

Mon Oct 19, 2020 3:21 am

Greetings,
but the domain controllers are virtual guests of the cluster, so can't be reached while bringing up cluster services if the whole cluster is offline.
As DCs, we recommend using 2 local VMs rather than clustered ones. Learn more at https://knowledgebase.starwindsoftware. ... san-usage/.
Devices that report as Synchronized in both PowerShell and the StarWind Management Console are showing 0% synchronized
Can I have the logs please, to confirm that the devices are synchronized? Please collect the logs with StarWind Log Collector (https://knowledgebase.starwindsoftware. ... collector/) and transfer them via Google Disk.

Thank you for the script. I have already requested adding a Maintenance Mode check with PowerShell to StarWindX.
wallewek
Posts: 114
Joined: Wed Sep 20, 2017 9:13 pm

Mon Oct 19, 2020 7:17 pm

Interesting, I do have dual DCs, I'll have to think on that.

I've uploaded the logs as requested and PMed you the link.
------------------------
"In theory, theory and practice are the same, but in practice they're not." -- Yogi Berra
wallewek
Posts: 114
Joined: Wed Sep 20, 2017 9:13 pm

Wed Oct 21, 2020 5:24 pm

In regards to the PowerShell reporting of synch percentage, I'm beginning to get the feeling that StarWind doesn't bother to report sync percentage accurately unless (a) the devices are not synchronized but (b) synchronization is in progress.

In other words, unless device state is 0 (Active) and Sync Status is 2 (Synchronizing), the sync percent value as reported in PowerShell variable ha_sync_percent should be ignored. A fully synchronized device might report as zero percent synced, but it is irrelevant.

That's pretty much how the StarWind Management Console works: it only reports percentage synchronized while synch is in progress.

You might want to pass that on internally, or confirm.

Oh, and just a quick update -- I have now pulled one DC out of the Microsoft Failover cluster, as you recommended. It makes sense. I might do the other as well, onto the other server.

--- kenw
------------------------
"In theory, theory and practice are the same, but in practice they're not." -- Yogi Berra
yaroslav (staff)
Staff
Posts: 2279
Joined: Mon Nov 18, 2019 11:11 am

Fri Oct 23, 2020 6:47 am

Greetings.

The script does regular tries to acquire the synchronization status and displays it to the user.
I believe that I have solved the riddle of synchronization status. The output something to do with the HA device. If you look carefully at the outputs you have, one says 100% synchronized for local while another says 0% synchronized for another local device. Device HA priority in your setup is mixed.

Here is how you can change a StarWind HA device priority.
1) Make sure that you shut down StarWind VSAN services. Just go to Task Manager and disable everything related to StarWind.
2) Go to the device folder and open the corresponding *_HA.swdsk as a text file (say, with Notepad++, or Notepad). Your device name is vsan-02.
Find the section looking like that
<node id="2" name="iqn.2008-08.com.starwindsoftware:10.255.255.252-vsan-02" shut="false" active="true">
<storages>
<storage_ref id="2"/>
</storages>
<parameters>
<type>1</type>
<priority>1</priority> - OWNER NODE PARAMETER
<sync_status>1</sync_status>
</parameters>
</node>
<node id="3" name=" iqn.2008-08.com.starwindsoftware:<node name>-vsan-01" shut="false" active="true">
<storages>
<storage_ref id="3"/>
</storages>
<parameters>
<type>1</type>
<priority>0</priority> - PARTNER NODE PARAMETER
<sync_status>1</sync_status>
</parameters>
</node>
Change the <priority>1</priority> to <priority>0</priority> (i.e., the owner node parameter) and <priority>0</priority> to <priority>1</priority> (i.e., the partner node parameter). Save the file. Enable all the StarWind services which you have disabled during the preparation phase.
Now, go to the vsan-01 host (I think the partner node is called like that).
1) Make sure that you shut down StarWind VSAN services. Just go to the Task Manager and disable all services related to StarWind.
2) Go to the device folder and open the corresponding *_HA.swdsk as a text file (say, with Notepad++ or Notepad). Your device name is vsan-01.
Find the section which looks like that
<node id="2" name=" iqn.2008-08.com.starwindsoftware:<node name>-vsan-01" shut="false" active="true">
<storages>
<storage_ref id="2"/>
</storages>
<parameters>
<type>1</type>
<priority>0</priority> - OWNER NODE PARAMETER
<sync_status>1</sync_status>
</parameters>
</node>
<node id="3" name=" iqn.2008-08.com.starwindsoftware:10.255.255.252-vsan-02" shut="false" active="true">
<storages>
<storage_ref id="3"/>
</storages>
<parameters>
<type>1</type>
<priority>1</priority> - PARTNER NODE PARAMETER
<sync_status>1</sync_status>
</parameters>
</node>
Change the <priority>0</priority> to <priority>1</priority> (i.e., the owner node parameter) and <priority>1</priority> to <priority>0</priority> (i.e., the partner node parameter). Save the file. Enable all the StarWind services and go to SW Management console to verify that the changes were successfully applied.

From <https://starwindhelp.zendesk.com/agent/tickets/204924>

PLEASE DO THAT ONLY FOR THE WITNESS DEVICE!!! Once you do the change, the active primary (KMHV3) will report Synchronized 100% for all devices, the secondary (KMHV4) will report 0% synchronized as it is not the primary

Also did a config review of your headers.
There is a misconfig: 10.4.34.x is used for both iSCSI and synchronization. Please remove it from synchronization interfaces.
Could you also tell me what all the connections are for?
10.4.35.x, 10.4.34.x., and 10.4.36.x are for synchronization, while 10.4.34.x is also for iSCSI we recommend using dedicated channels for each type of traffic.

Regarding the cluster failures, there are chkdsks are running
26212 kmHV4.kmsi.net 110020 Information "Chkdsk was executed in read-only mode on a volume snapshot.



Checking file system on E:
The type of the file system is NTFS.
Volume label is Local Data HV4.

WARNING! /F parameter not specified.
Running CHKDSK in read-only mode.

Stage 1: Examining basic file system structure ...


1536 file records processed.

File verification completed.


121 large file records processed.



0 bad file records processed.


Stage 2: Examining file name linkage ...


1582 index entries processed.

Index verification completed.


0 unindexed files scanned.



0 unindexed files recovered to lost and found.


Stage 3: Examining security descriptors ...
Security descriptor verification completed.


23 data files processed.


Windows has scanned the file system and found no problems.
No further action is required.

792439807 KB total disk space.
657198312 KB in 531 files.
268 KB in 25 indexes.
0 KB in bad sectors.
91675 KB in use by the system.
65536 KB occupied by the log file.
135149552 KB available on disk.

4096 bytes in each allocation unit.
198109951 total allocation units on disk.
33787388 allocation units available on disk.
" Chkdsk

and

1792 kmHV3.kmsi.net 1157178 Error "Cluster physical disk resource failed periodic health check.

Physical Disk resource name: Cluster Disk 2
Device Number: 4
Device Guid: {d5d6f8df-7081-8b74-e91e-118eb289f818}
Error Code: 1167
Additional reason: ClusDiskReportedFailure

If the reason is ReattachTimeout, it means attaching a new RHS process to the disk resource took too long.
If the reason is ClusDiskReportedFailure, it means the underlying disk device was removed from the system.
If the reason is QuorumResourceFailure, it means this is a Spaces quorum resource.
If the reason is VolumeNotHealthy, it means one of the volumes is not healthy and may need repair." Microsoft-Windows-FailoverClustering 10/19/2020 13:32

chkdsk brings the volume offline, so be careful with that one. here is how to run checks https://www.starwindsoftware.com/blog/h ... rwind-vsan.
What I would recommend is plan downtime and chkdsk on underlying storage and HA devices.
wallewek
Posts: 114
Joined: Wed Sep 20, 2017 9:13 pm

Fri Oct 23, 2020 9:55 pm

Thank you VERY much for that excellent detailed analysis, Yaroslav!

First, just so you know, the CSV storage on KMHV3 had already been identified as failing, and I have replacement storage sitting here waiting to be swapped in as soon as I have the cluster fully synchronized.

I'm not sure why the CHKDSK was running when I ran the snapshot, perhaps timing? I do run Hard Disk Sentinel to keep track of drive health, which is why I knew I need to replace the CSV drive on HV3. I have just rerun a quick CHKDSK on all drives on both servers while all cluster and StarWind services are stopped, and got zero errors. I will review your recommended procedure later.

Re Priorities:
I have now made the Priority changes as per your excellent directions, and resynch is proceeding. So far, however, the synchronized-0% reporting from PowerShell hasn't changed. For whatever that's worth.

Re NICs:
These servers each have 5 Intel Gigabit Ethernet interfaces, with four of them on a quad port NIC and one on the motherboard. All the ports on the quad NICs are dedicated to StarWind; the onboard NICs are used for host management, Internet, etc.

The quad NICs are arranged in four pairs across the two hosts, with each NIC on one host in a separate subnet corresponding to the same subnet on the other host. So:

Host 1 (KMHV3) <--> Host 2 (KMHV4)
10.4.27.81 <-switch-> 10.4.27.82 -- management
10.4.33.81 <-switch-> 10.4.33.82 -- StarWind
10.4.34.81 <-switch-> 10.4.34.82 -- StarWind
10.4.35.81 <-switch-> 10.4.35.82 -- StarWind
10.4.36.81 <-patch-> 10.4.36.82 -- StarWind

All of the Ethernet connections go through an HP Procurve managed switch except for the last, which is patched directly across between the servers. I do that because the switch gives me more diagnostic information than modern Intel NIC drivers will provide, like packet errors and such, and also it avoids some bogus error statuses.

Thank you for the catch on the NIC synch/iSCSI misconfiguration. I kind of suspected something like that, but couldn't figure out how to confirm it. However, I'm not sure how to do what you are recommending.

Are you suggesting that I change the NIC assignments directly in the *-HA.swdsk files? There's no other way I know of, as I don't have access to that configuration interface in the StarWind Management Console, and I'm not aware of any way to make such changes via Powershell.

Looking at how the NICs are configured in that swdsk file, all 5 NICs are listed under the iSCSI transport section for storage id=3, and subnet 34 doesn't stand out as any different. This seems consistent for all four *-HA.swdsk files, btw.

If that's the case, how would I proceed? Is this documented anywhere?

--- kenw
------------------------
"In theory, theory and practice are the same, but in practice they're not." -- Yogi Berra
yaroslav (staff)
Staff
Posts: 2279
Joined: Mon Nov 18, 2019 11:11 am

Fri Oct 23, 2020 11:30 pm

You are always welcome.

You can just change that link to "control" in HA device header.
1) Make sure that you shut down StarWind VSAN services. Just go to the Task Manager and disable all services related to StarWind.
2) Go to the device folder and open the corresponding *_HA.swdsk as a text file (say, with Notepad++ or Notepad). MAKE SURE TO MAKE A COPY OF THE EXISTING FILE.
3) Find this part
<link id="1" type="data" priority="1" connections="1">
<peer ip="10.4.35.82" port="3260"/>
</link>
<link id="2" type="data" priority="1" connections="1">
<peer ip="10.4.34.82" port="3260"/>
</link>
<link id="3" type="control" priority="1" connections="1">
<peer ip="10.4.27.82" port="3260"/>
</link>
<link id="4" type="control" priority="1" connections="1">
<peer ip="10.4.33.82" port="3260"/>
</link>
<link id="5" type="data" priority="1" connections="1">
<peer ip="10.4.36.82" port="3260"/>
</link>
4) change type="data" to type="control" for that link.
5) do that for both HA devices.
6) start the service.
7) wait for the fast sync to finish
8) repeat the procedure for the other server.

Just go ahead and swap the disks once you are done with the configuration I've mentioned.
synchronized-0% reporting from PowerShell hasn't changed.
The reporting thing has something to do with the priorities. Synchronized 0% is for the secondary, synchronized 100% is for primary.
There is a way to confirm sync status: the application log. Collect the logs with StarWind Log Collector and confirm that the last event for each HA device is "Synchronized" (event ID is 773 -- see more about event IDs here https://www.starwindsoftware.com/help/E ... dVSAN.html).
wallewek
Posts: 114
Joined: Wed Sep 20, 2017 9:13 pm

Sun Oct 25, 2020 7:54 pm

I made the link changes following your instructions. That worked very nicely, thank you! However, the zero percent synch readings didn't change afterward. I've vaguely wondering if that had to do with my shutting down both servers at the same time, triggering a full synch.

OK, the failing hard drive has now been replaced. I just shut down that server only, attached the drive, booted it up, stopped the VSAN service, swapped drive letters, copied the contents of the old drive to its replacement (Windows File Explorer works fine), and started the VSAN service. It synched up immediately. Nice.

Interestingly, the GetHASyncState.ps1 script now shows them both at 100%, I have no idea why the change. Maybe Fast Synch does that?

I tried using the blog article you suggested https://www.starwindsoftware.com/blog/h ... rwind-vsan to check the CSV, but could not access it when the cluster was in maintenance mode (the cluster volume disappears), so I did it while the cluster was up, and forced dismount. Dirty, I know, but that worked, and reported zero errors.

Thanks VERY much for all your guidance and analysis. Things are looking up all over!

--- kenw
------------------------
"In theory, theory and practice are the same, but in practice they're not." -- Yogi Berra
yaroslav (staff)
Staff
Posts: 2279
Joined: Mon Nov 18, 2019 11:11 am

Mon Oct 26, 2020 4:18 am

Always here to help you. Yes, with img files and headers in place, there should be Fast Sync.
I am glad to know that all is working fine.
Do not hesitate to contact us if any assistance is required.
wallewek
Posts: 114
Joined: Wed Sep 20, 2017 9:13 pm

Mon Oct 26, 2020 9:50 pm

Yaroslav, are there known conflicts or recommendations re: StarWind VSAN and Windows Backup on hosts? My VSAN went down abruptly (again), apparently right at the time a backup was scheduled to start. To be clear, I am NOT trying to backup VSAN images.

The symptom was that all iSCSI channels -- both synchronization and heartbeat -- were reported as down in the StarWind management console. Synch was totally lost, and the cluster was down. Restarting one host restored the iSCSI connections, but I had to force synched status on one host before it would start up again, going into full synch.

I've been digging through the event logs. This has happened several times lately. Not sure why, as I haven't change much like backups for a long time. Maybe it's related to Microsoft updates. For the moment, I've disabled WIndows Backup, and I'm digging into that. I did take diagnostic dumps and screenshots.

--- kenw
------------------------
"In theory, theory and practice are the same, but in practice they're not." -- Yogi Berra
Post Reply