Starwind 6. Sync over 10 Gbit very slow

AAZ · Wed Apr 10, 2019 10:41 am

Good afternoon.
The situation is as follows: iSCSI storage is organized from two StarWind 6 hosts. Operating systems are win2016. All this is deployed on virtual machines under ESXi 6.5. Server hardware is connected by 10 Gbit links. iperf on ESXi shows 10 gbit. Iperf on WIN2016 hosts on which StarWind is deployed also shows 10 Gbit. We rebooted one of the Starvinde hosts - the synchronization was fast. Reboot the second host. Started Full replication. And it is already the second day. 42%.
In the band is engaged a maximum of 2.2 Gbit.

There are two questions:

1. Why Full, when differential would suffice
2. Is it so slow? Why not use the whole band? What restrictions can?

Wed Apr 10, 2019 2:36 pm

Upgrade to the most recent version of StarWind. V6 went out of support years ago.

AAZ · Wed Apr 10, 2019 3:14 pm

And in the existing configuration there is a possibility to correct the situation? Without changing the version of Starwind?
Can I install the latest version of Starwind over 6?

Serhi · Thu Apr 11, 2019 1:38 pm

AAZ wrote: Can I install the latest version of Starwind over 6?

Hello
Yes, you can. Stop the StarWind service, run install, start StarWind service after that.

BR, Serhi

Fri Apr 12, 2019 3:45 pm

Hello AAZ,
Please find the link with instructions below:
https://knowledgebase.starwindsoftware. ... d-version/

ss4344 · Fri Apr 19, 2019 12:55 pm

What was the outcome upgrading from v6 to v8? Curious as I'm on v8 latest and am seeing the exact same thing in terms of full sync performance. Drives me nuts every controlled power outage (or UPS test) where both nodes are shut down, hence full sync time.

I've twin 10Gbps links, and each are only doing around 1.1Gbps (~140 MBytes/sec/link) during the sync. Jumbo frames are on and working, verified able to sustain close to 10Gbps. When running iSCSI over the same links with large block 8MB transfers it maxes throughput at around 1.2 GBytes/sec per link, which is ~9.5Gbps per channel from a Starwind target on the opposite node. Yet still, full sync is relatively slow - almost a 10th of what is possible.

Messing with <MaxSyncQueueSize size="16"/> in the source and destination target configs to "32" or "64" had no effect, and likewise setting it to "8" and "2" had no effect. All of these resulted in ~1Gbps utilisation of the 10Gbps links.

I have a hyperconverged SSD config, with the raw drives in a RAID0 stripe maxing out at 1,477,037 KB/sec write, 1,371,568 KB/sec read according to ATTO Disk Bench. I see the same phenomenon on the HDDs also in the system, which are mirror/striped and also perform well (just not as well as the SSDs for write).

Fri Apr 19, 2019 2:07 pm

What v8 build do you use? From 11818 on, StarWind VSAN has got Maintenance Mode for the HA devices, which helps avoiding full sync on controlled power offs.
Also, you can have a look at https://www.starwindsoftware.com/resour ... powerchute and adjust the script as needed in case no APC is available.

ss4344 · Sat Apr 20, 2019 1:20 am

Boris (staff) wrote:From 11818 on, StarWind VSAN has got Maintenance Mode for the HA devices...

You're a good man to point that out. It will be implemented today!

Could the slow sync be due to the small number of processor cores in my set up, hence worker threads? Would it be ill-advisable to mess with those settings?

Tue Apr 23, 2019 4:05 pm

Yes, that can be attributed to low number of cores.

ss4344 · Tue Apr 23, 2019 4:22 pm

Boris (staff) wrote:...have a look at https://www.starwindsoftware.com/resour ... powerchute ...

Great script, and Starwind works perfectly with maintenance mode.

As for Hyper-V failover clusters and their behaviour when you're not using SCVMM and its maintenance mode, not so much

. Several more hours to be wasted to get a graceful shutdown/restart required methinks, as maintenance mode of Starwind causes the cluster shared volumes to go offline, which causes some interesting/bad/unwanted Hyper-V behaviour during the shutdown. Glad you guys have your bit done right, and hope I don't need to wait for Windows 2025 for MS to make it straightforward enough for this IT veteran to get it right.

Many thanks.

Tue Apr 23, 2019 4:53 pm

Did the script work as expected for you?
You will most likely need to implement the VM shutdown part on your own, so that when enabling Maintenance Mode on StarWind devices (with the script) you would have all VMs powered off. If those are still running, the VMs will run into issues with data integrity, as storage will no longer be available for them.

ss4344 · Wed Apr 24, 2019 6:22 am

Boris (staff) wrote:Did the script work as expected for you?

Actually, Something is not quite right with the Starwind bit.

I have gotten my scripts to the point of everything seeming to work, but on re-start the second node (never seen the first do this) shows that it is re-synchronising, and takes quite a while.

Code: Select all

PS C:\Windows\system32> C:\Scripts\SW-SyncState.ps1
Device sync state, node 1:
iqn.2008-08.com.starwindsoftware:map59-n1-target3	Status: 1, completed 0%
Device sync state, node 2:
iqn.2008-08.com.starwindsoftware:map59-n2-target3	Status: 2, completed 5%

PS C:\Windows\system32>

Both nodes try to put the disks into maintenance mode, and maybe that's messing something up. I was fairly certain that one would win, and the other would simply return "already in maintenance mode".

UPS shutdown log on node 1 shows the lines:

Code: Select all

Putting Starwind in maintenance mode
HAImage3: Entered maintenance mode
imagefile3: Not an HA device

UPS shutdown log on node 2 shows the lines:

Code: Select all

Putting Starwind in maintenance mode
Operation cannot be completed. Maintenance mode is already turned on.

Startup log on node 1 shows:

Code: Select all

Taking Starwind out of maintenance mode
HAImage3: Operation cannot be completed. Maintenance mode is already turned off.
imagefile3: Not an HA device

Startup log on node 2 ALSO shows:

Code: Select all

Taking Starwind out of maintenance mode
HAImage3: Operation cannot be completed. Maintenance mode is already turned off.
imagefile3: Not an HA device

In the application event log on node 1 (which won the maintenance mode fight), I see the maintenance mode operation is rolled back:

Code: Select all

Warning: HA Device iqn.2008-08.com.starwindsoftware:map59-n1-target3: maintenance mode is turning ON...
...Two seconds later all of these events together at same timestamp:
Warning: HA Device iqn.2008-08.com.starwindsoftware:map59-n1-target3: synchronization connection IP 192.168.20.2 with partner node iqn.2008-08.com.starwindsoftware:map59-n2-target3 lost.
Warning: HA Device iqn.2008-08.com.starwindsoftware:map59-n1-target3: heartbeat connection IP 192.168.40.2 with partner node iqn.2008-08.com.starwindsoftware:map59-n2-target3 lost.
Warning: HA Device iqn.2008-08.com.starwindsoftware:map59-n1-target3: synchronization connection IP 192.168.21.2 with partner node iqn.2008-08.com.starwindsoftware:map59-n2-target3 lost.
Error: HA Device iqn.2008-08.com.starwindsoftware:map59-n1-target3: all synchronization connection with partner node iqn.2008-08.com.starwindsoftware:map59-n2-target3 lost.
Error: HA Device iqn.2008-08.com.starwindsoftware:map59-n1-target3: partner node iqn.2008-08.com.starwindsoftware:map59-n2-target3 state has changed to "Not synchronized".
Warning: HA Device iqn.2008-08.com.starwindsoftware:map59-n1-target3: heartbeat connection IP 192.168.10.22 with partner node iqn.2008-08.com.starwindsoftware:map59-n2-target3 lost.
Error: HA Device iqn.2008-08.com.starwindsoftware:map59-n1-target3: all heartbeat connection with partner node iqn.2008-08.com.starwindsoftware:map59-n2-target3 lost.
Error: HA Device iqn.2008-08.com.starwindsoftware:map59-n1-target3: maintenance mode operation is rolled back for current node.
Three seconds later...
Error: HA Device iqn.2008-08.com.starwindsoftware:map59-n1-target3: current node state has changed to "Not synchronized".
Information: Service is stopped - StarWind Virtual SAN v8.0.0 (Build 12767, [SwSAN], Win64).

I thought it was working OK while developing the script. I recall that maintenance mode worked well at the time when my shutdown had issues ... Windows Cluster service would hang shutting down because the Cluster Shared Volumes were not offline. I was killing power to that node. Now I explicitly offline the CSVs it shuts down by itself correctly, but this issue of roll-back has now cropped up.

Script details...

For shutdown:

I needed to set a single owner node for each VM (to stop the cluster trying to migrate any VMs)
I had to coordinate script progression to ensure both nodes were at roughly the same stage
I needed to offline the CSVs (or the cluster would hang on last node to shut down)
I needed to set the cluster and virtual machine management services to manual start
Then finally set maintenance mode, schedule the startup task and power down

For startup, basically reverse the process.

Add an extra step to wait until Starwind services are running on both nodes
Set the cluster and vmm services to automatic and start them

The reason for messing with cluster & vmm services is to support saving the VM state instead of shutting down guests. Without these steps my cluster had a mind of its own in regard to startup sequencing, and ended up failing saved VMs faster than the storage could become available. By starting the servers with these disabled, the script can get the storage online and then start the cluster. The commands to explicitly start the VMs (which would happen anyway for saved VMs) are there in case guest "shutdown" is used, and not "save".

Lines 12/13 of the USP-Shutdown-Node script determine which shutdown method is used - shutdown or save - with the undesired option commented out.

This is on Windows Server 2019.

Thanks again for pointing out this approach, Boris.
Steve

(UPDATE: The scripts below have been updated, and replaced by those in post 18 of this thread...)

C:\Scripts\UPS-Shutdown-Node.ps1
Executed with Eaton IPP:
cmd.exe /c "%SystemRoot%\sysnative\WindowsPowerShell\v1.0\powershell.exe -ExecutionPolicy Bypass -NoProfile -File c:\scripts\UPS-Shutdown-Node.ps1 > c:\scripts\UPS-Shutdown-Node.log 2>&1"

Code: Select all

$NodeFile = "C:\Scripts\nodes.tmp"

$NodeName = (Get-WmiObject win32_computersystem).DNSHostName+"."+(Get-WmiObject win32_computersystem).Domain

# Save the cluster node names for the ExitMaintenance script to use later
$nodes = (Get-ClusterNode).Name
$nodes | Set-Content -Path $NodeFile

# Shut down VMs on this node, and prevent live migration by setting the owner node to just this node
Write-Host "Stopping $((Get-VM).Count) virtual machines"
$VMs = (Get-VM).Name
Get-VM | Stop-VM -Save -AsJob > $null 2>&1
#Get-VM | Stop-VM -AsJob > $null 2>&1
Get-ClusterResource | ? {($VMs -contains $_.OwnerGroup) -and ($_.ResourceType -eq "Virtual Machine")}  | Set-ClusterOwnerNode -Owners (Get-WmiObject win32_computersystem).DNSHostName

# Wait until all virtual machine resources on all cluster nodes are offline
Write-Host "Waiting for all virtual machines to stop"
do { sleep 1 } while (((Get-ClusterResource | ? {($_.ResourceType -eq "Virtual Machine") -and ($_.State -ne "Offline")}) | measure).Count -gt 0)

# take all cluster shared volumes offline
Write-Host "Taking cluster shared volumes offline"
do {
    Sleep 1
    $OnlineCSVs = Get-ClusterSharedVolume | ? {$_.State -eq 'Online'}
    foreach ($OnlineCSV in $OnlineCSVs) { Stop-ClusterResource -Name $OnlineCSV.Name -ErrorAction SilentlyContinue }
} until (($OnlineCSVs | measure).Count -eq 0)

#Exit-PSSession

Import-Module StarWindX

# Set Starwind devices in maintenance mode
Write-Host "Putting Starwind in maintenance mode"
try {
    $server = New-SWServer -host $NodeName -port 3261 -user root -password starwind
    $server.Connect()
    foreach($device in $server.Devices) {
        if( !$device ) {
            Write-Host "No device found" -foreground red
            return
        } else {
            $disk = $device.Name
            if ($device.Name -like "HAimage*") {
                $device.SwitchMaintenanceMode($true, $true)
                Write-Host "$($disk): Entered maintenance mode"
            } else {
                Write-Host "$($disk): Not an HA device"
            }
        }
    }
}
 
catch {
    Write-Host $_ -foreground red
}
 
finally {
    $server.Disconnect()
}
 
# Create a scheduled task to disable maintenance mode on startup
Write-Host "Creating scheduled task to exit maintenance mode"
try {
    $action = New-ScheduledTaskAction -Execute "Powershell.exe" -Argument '-command "Powershell -ExecutionPolicy Bypass -NoProfile -File C:\Scripts\SW-ExitMaintenance.ps1 > C:\Scripts\SW-ExitMaintenance.log 2>&1"'
    $trigger = New-ScheduledTaskTrigger -AtStartup -RandomDelay 00:00:30
    $settings = New-ScheduledTaskSettingsSet -Compatibility Win8
    $principal = New-ScheduledTaskPrincipal -UserId SYSTEM -LogonType ServiceAccount -RunLevel Highest
    $definition = New-ScheduledTask -Action $action -Principal $principal -Trigger $trigger -Settings $settings -Description "Exit maintenance mode for Starwind HA devices"
    Register-ScheduledTask -TaskName "Maintenance Mode Off" -InputObject $definition > $null 2>&1
}

catch {
    Write-Host $_ -foreground red
}

# Set the Cluster and Virtual Machine Management services to manual
Write-Host "Setting cluster and vmms services to manual startup"
Get-Service -Name vmms | Set-Service -StartupType Manual
Get-Service -Name ClusSvc | Set-Service -StartupType Manual

# Shut down the node
Write-Host "Stopping cluster node"
Stop-Computer -Force

C:\Scripts\SW-ExitMaintenance.ps1

Code: Select all

$NodeFile = "C:\Scripts\nodes.tmp"
$ServiceTimeout = 120

$NodeName = (Get-WmiObject win32_computersystem).DNSHostName+"."+(Get-WmiObject win32_computersystem).Domain

# Get the cluster node names that were saved by the UPS-Shutdown-Node script (can't use Get-ClusterNode as cluster not started yet)
$nodes = Get-Content -Path $NodeFile

Import-Module StarWindX

$ServiceName = "StarWindService"
do {
    sleep 1
    $s1 = Get-Service -ComputerName $nodes[0] -Name $ServiceName -ErrorAction SilentlyContinue
    $s2 = Get-Service -ComputerName $nodes[1] -Name $ServiceName -ErrorAction SilentlyContinue
} until (($s1.Status -eq "Running") -and ($s2.Status -eq "Running"))

Write-Host "$ServiceName is running on both nodes"
Start-Sleep -Milliseconds (Get-Random -Maximum 5000)

# Take Starwind devices out of maintenance mode
Write-Host "Taking Starwind out of maintenance mode"
try {
    $server = New-SWServer -host $NodeName -port 3261 -user root -password starwind
    $server.Connect()

    foreach ($device in $server.Devices) {
        if (!$device) {
            Write-Host "No device found" -foreground red
            return
        } else {
            $disk = $device.Name
            if ($device.Name -like "HAimage*") {
                try {
                    $device.SwitchMaintenanceMode($false, $true)
                    Write-Host "$($disk): Exited maintenance mode"
                }
                catch {
                    Write-Host "$($disk): $($_)"
                }
            } else {
                Write-Host "$($disk): Not an HA device"
            }
        }
    }
}
 
catch {
    Write-Host $_ -foreground red
}
 
finally {
    $server.Disconnect()
}

# Set the Cluster and Virtual Machine Management services to automatic, and start the services
Write-Host "Setting cluster and vmms services to manual startup"
Get-Service -Name vmms | Set-Service -StartupType Automatic
Get-Service -Name vmms | Start-Service
Get-Service -Name ClusSvc | Set-Service -StartupType Automatic
Get-Service -Name ClusSvc | Start-Service
$c1 = 0
$c2 = 0
$c3 = 0
$c4 = 0
do {
    $s1 = (Get-Service -ComputerName $nodes[0] -Name vmms).Status
    $s2 = (Get-Service -ComputerName $nodes[0] -Name ClusSvc).Status
    $s3 = (Get-Service -ComputerName $nodes[1] -Name vmms).Status
    $s4 = (Get-Service -ComputerName $nodes[1] -Name ClusSvc).Status
    if ($s1 -ne "Running") { $c1 += 1 } else { Write-Host "$($nodes[0]): Virtual Machine Management service running after $($c1) seconds" }
    if ($s2 -ne "Running") { $c2 += 1 } else { Write-Host "$($nodes[0]): Cluster service running after $($c2) seconds" }
    if ($s3 -ne "Running") { $c3 += 1 } else { Write-Host "$($nodes[1]): Virtual Machine Management service running after $($c3) seconds" }
    if ($s4 -ne "Running") { $c4 += 1 } else { Write-Host "$($nodes[1]): Cluster service running after $($c4) seconds" }
    if (($s1 -ne "Running") -or ($s2 -ne "Running") -or ($s3 -ne "Running") -or ($s4 -ne "Running")) { Sleep 1 }
} until ((($s1 -eq "Running") -and ($s2 -eq "Running") -and ($s3 -eq "Running") -and ($s4 -eq "Running")) -or ($c1 -gt $ServiceTimeout) -or ($c2 -gt $ServiceTimeout) -or ($c3 -gt $ServiceTimeout) -or ($c4 -gt $ServiceTimeout))
if (($c1 -gt $ServiceTimeout) -or ($c2 -gt $ServiceTimeout) -or ($c3 -gt $ServiceTimeout) -or ($c4 -gt $ServiceTimeout)) {
    Write-Host "A service failed to start"
    exit
}

# Make sure all CSVs are online
Write-Host "Waiting for all cluster shared volumes online"
do {
    Sleep 1
    $OfflineCSVs = Get-ClusterSharedVolume | ? {$_.State -ne 'Online'}
    foreach ($OfflineCSV in $OfflineCSVs) { Start-ClusterResource -Name $OfflineCSV.Name -ErrorAction SilentlyContinue > $null 2>&1 }
} until (($OfflineCSVs | measure).Count -eq 0)

# Small pause to ensure cluster recognises that cluster shared volumes are up
Sleep 15

# Start the VMs on all nodes, and set any node to be the owner
Write-Host "Starting virtual machines"
Get-ClusterResource | ? {($_.ResourceType -eq "Virtual Machine") -and ($_.State -eq "Offline")} | Start-ClusterResource

Write-Host "Setting virtual machine possible owners to all nodes"
Get-ClusterResource | ? {($_.ResourceType -eq "Virtual Machine") -and ($_.State -eq "Offline")} | Set-ClusterOwnerNode -Owners (Get-ClusterNode).Name

Write-Host "Unregistering scheduled task"
Unregister-ScheduledTask -TaskName "Maintenance Mode Off" -Confirm:$false -ErrorAction SilentlyContinue

Thu Apr 25, 2019 3:18 pm

1. You need to put a device into maintenance mode on one node only. As it is replicated, it will enter that state on both nodes.
2. On power on, fast synchronization is an expected thing, but this would not last really long. It should be a matter of seconds, not minutes. Could you report your observations?

ss4344 · Thu Apr 25, 2019 10:11 pm

Observations are that a full sync is occurring after boot of both nodes because the maintenance mode (which had been set successfully according to logs) is being rolled back just prior to power-off for some reason. I can not see an obvious reason for that roll-back.

Could it be related to sending the cluster shared volumes offline? I would think not.

In the logs, all heartbeat network links came down just prior to roll-back, indicating that at that time the partner node must have reached a power-off state before the node that was successful in setting maintenance mode. Should the node that set maintenance mode be the one to reach a power-off state before its partner? I would think it should not matter, but it might.

Thu Apr 25, 2019 10:26 pm

Maintenance mode cannot be rolled back automatically unless it is triggered by the script. Sending CSVs offline cannot influence this in any way. When you check the StarWind logs / Windows Application logs, do you see any entry about MM being turned off for the devices prior to server shutdown?