Boris (staff) wrote:Did the script work as expected for you?
Actually, Something is not quite right with the Starwind bit.
I have gotten my scripts to the point of everything seeming to work, but on re-start the second node (never seen the first do this) shows that it is re-synchronising, and takes quite a while.
Code: Select all
PS C:\Windows\system32> C:\Scripts\SW-SyncState.ps1
Device sync state, node 1:
iqn.2008-08.com.starwindsoftware:map59-n1-target3 Status: 1, completed 0%
Device sync state, node 2:
iqn.2008-08.com.starwindsoftware:map59-n2-target3 Status: 2, completed 5%
PS C:\Windows\system32>
Both nodes try to put the disks into maintenance mode, and maybe that's messing something up. I was fairly certain that one would win, and the other would simply return "already in maintenance mode".
UPS shutdown log on node 1 shows the lines:
Code: Select all
Putting Starwind in maintenance mode
HAImage3: Entered maintenance mode
imagefile3: Not an HA device
UPS shutdown log on node 2 shows the lines:
Code: Select all
Putting Starwind in maintenance mode
Operation cannot be completed. Maintenance mode is already turned on.
Startup log on node 1 shows:
Code: Select all
Taking Starwind out of maintenance mode
HAImage3: Operation cannot be completed. Maintenance mode is already turned off.
imagefile3: Not an HA device
Startup log on node 2 ALSO shows:
Code: Select all
Taking Starwind out of maintenance mode
HAImage3: Operation cannot be completed. Maintenance mode is already turned off.
imagefile3: Not an HA device
In the application event log on node 1 (which won the maintenance mode fight), I see the maintenance mode operation is rolled back:
Code: Select all
Warning: HA Device iqn.2008-08.com.starwindsoftware:map59-n1-target3: maintenance mode is turning ON...
...Two seconds later all of these events together at same timestamp:
Warning: HA Device iqn.2008-08.com.starwindsoftware:map59-n1-target3: synchronization connection IP 192.168.20.2 with partner node iqn.2008-08.com.starwindsoftware:map59-n2-target3 lost.
Warning: HA Device iqn.2008-08.com.starwindsoftware:map59-n1-target3: heartbeat connection IP 192.168.40.2 with partner node iqn.2008-08.com.starwindsoftware:map59-n2-target3 lost.
Warning: HA Device iqn.2008-08.com.starwindsoftware:map59-n1-target3: synchronization connection IP 192.168.21.2 with partner node iqn.2008-08.com.starwindsoftware:map59-n2-target3 lost.
Error: HA Device iqn.2008-08.com.starwindsoftware:map59-n1-target3: all synchronization connection with partner node iqn.2008-08.com.starwindsoftware:map59-n2-target3 lost.
Error: HA Device iqn.2008-08.com.starwindsoftware:map59-n1-target3: partner node iqn.2008-08.com.starwindsoftware:map59-n2-target3 state has changed to "Not synchronized".
Warning: HA Device iqn.2008-08.com.starwindsoftware:map59-n1-target3: heartbeat connection IP 192.168.10.22 with partner node iqn.2008-08.com.starwindsoftware:map59-n2-target3 lost.
Error: HA Device iqn.2008-08.com.starwindsoftware:map59-n1-target3: all heartbeat connection with partner node iqn.2008-08.com.starwindsoftware:map59-n2-target3 lost.
Error: HA Device iqn.2008-08.com.starwindsoftware:map59-n1-target3: maintenance mode operation is rolled back for current node.
Three seconds later...
Error: HA Device iqn.2008-08.com.starwindsoftware:map59-n1-target3: current node state has changed to "Not synchronized".
Information: Service is stopped - StarWind Virtual SAN v8.0.0 (Build 12767, [SwSAN], Win64).
I thought it was working OK while developing the script. I recall that maintenance mode worked well at the time when my shutdown had issues ... Windows Cluster service would hang shutting down because the Cluster Shared Volumes were not offline. I was killing power to that node. Now I explicitly offline the CSVs it shuts down by itself correctly, but this issue of roll-back has now cropped up.
Script details...
For shutdown:
- I needed to set a single owner node for each VM (to stop the cluster trying to migrate any VMs)
- I had to coordinate script progression to ensure both nodes were at roughly the same stage
- I needed to offline the CSVs (or the cluster would hang on last node to shut down)
- I needed to set the cluster and virtual machine management services to manual start
- Then finally set maintenance mode, schedule the startup task and power down
For startup, basically reverse the process.
- Add an extra step to wait until Starwind services are running on both nodes
- Set the cluster and vmm services to automatic and start them
The reason for messing with cluster & vmm services is to support saving the VM state instead of shutting down guests. Without these steps my cluster had a mind of its own in regard to startup sequencing, and ended up failing saved VMs faster than the storage could become available. By starting the servers with these disabled, the script can get the storage online and then start the cluster. The commands to explicitly start the VMs (which would happen anyway for saved VMs) are there in case guest "shutdown" is used, and not "save".
Lines 12/13 of the USP-Shutdown-Node script determine which shutdown method is used - shutdown or save - with the undesired option commented out.
This is on Windows Server 2019.
Thanks again for pointing out this approach, Boris.
Steve
(UPDATE: The scripts below have been updated, and replaced by those in post 18 of this thread...)
C:\Scripts\UPS-Shutdown-Node.ps1
Executed with Eaton IPP:
cmd.exe /c "%SystemRoot%\sysnative\WindowsPowerShell\v1.0\powershell.exe -ExecutionPolicy Bypass -NoProfile -File c:\scripts\UPS-Shutdown-Node.ps1 > c:\scripts\UPS-Shutdown-Node.log 2>&1"
Code: Select all
$NodeFile = "C:\Scripts\nodes.tmp"
$NodeName = (Get-WmiObject win32_computersystem).DNSHostName+"."+(Get-WmiObject win32_computersystem).Domain
# Save the cluster node names for the ExitMaintenance script to use later
$nodes = (Get-ClusterNode).Name
$nodes | Set-Content -Path $NodeFile
# Shut down VMs on this node, and prevent live migration by setting the owner node to just this node
Write-Host "Stopping $((Get-VM).Count) virtual machines"
$VMs = (Get-VM).Name
Get-VM | Stop-VM -Save -AsJob > $null 2>&1
#Get-VM | Stop-VM -AsJob > $null 2>&1
Get-ClusterResource | ? {($VMs -contains $_.OwnerGroup) -and ($_.ResourceType -eq "Virtual Machine")} | Set-ClusterOwnerNode -Owners (Get-WmiObject win32_computersystem).DNSHostName
# Wait until all virtual machine resources on all cluster nodes are offline
Write-Host "Waiting for all virtual machines to stop"
do { sleep 1 } while (((Get-ClusterResource | ? {($_.ResourceType -eq "Virtual Machine") -and ($_.State -ne "Offline")}) | measure).Count -gt 0)
# take all cluster shared volumes offline
Write-Host "Taking cluster shared volumes offline"
do {
Sleep 1
$OnlineCSVs = Get-ClusterSharedVolume | ? {$_.State -eq 'Online'}
foreach ($OnlineCSV in $OnlineCSVs) { Stop-ClusterResource -Name $OnlineCSV.Name -ErrorAction SilentlyContinue }
} until (($OnlineCSVs | measure).Count -eq 0)
#Exit-PSSession
Import-Module StarWindX
# Set Starwind devices in maintenance mode
Write-Host "Putting Starwind in maintenance mode"
try {
$server = New-SWServer -host $NodeName -port 3261 -user root -password starwind
$server.Connect()
foreach($device in $server.Devices) {
if( !$device ) {
Write-Host "No device found" -foreground red
return
} else {
$disk = $device.Name
if ($device.Name -like "HAimage*") {
$device.SwitchMaintenanceMode($true, $true)
Write-Host "$($disk): Entered maintenance mode"
} else {
Write-Host "$($disk): Not an HA device"
}
}
}
}
catch {
Write-Host $_ -foreground red
}
finally {
$server.Disconnect()
}
# Create a scheduled task to disable maintenance mode on startup
Write-Host "Creating scheduled task to exit maintenance mode"
try {
$action = New-ScheduledTaskAction -Execute "Powershell.exe" -Argument '-command "Powershell -ExecutionPolicy Bypass -NoProfile -File C:\Scripts\SW-ExitMaintenance.ps1 > C:\Scripts\SW-ExitMaintenance.log 2>&1"'
$trigger = New-ScheduledTaskTrigger -AtStartup -RandomDelay 00:00:30
$settings = New-ScheduledTaskSettingsSet -Compatibility Win8
$principal = New-ScheduledTaskPrincipal -UserId SYSTEM -LogonType ServiceAccount -RunLevel Highest
$definition = New-ScheduledTask -Action $action -Principal $principal -Trigger $trigger -Settings $settings -Description "Exit maintenance mode for Starwind HA devices"
Register-ScheduledTask -TaskName "Maintenance Mode Off" -InputObject $definition > $null 2>&1
}
catch {
Write-Host $_ -foreground red
}
# Set the Cluster and Virtual Machine Management services to manual
Write-Host "Setting cluster and vmms services to manual startup"
Get-Service -Name vmms | Set-Service -StartupType Manual
Get-Service -Name ClusSvc | Set-Service -StartupType Manual
# Shut down the node
Write-Host "Stopping cluster node"
Stop-Computer -Force
C:\Scripts\SW-ExitMaintenance.ps1
Code: Select all
$NodeFile = "C:\Scripts\nodes.tmp"
$ServiceTimeout = 120
$NodeName = (Get-WmiObject win32_computersystem).DNSHostName+"."+(Get-WmiObject win32_computersystem).Domain
# Get the cluster node names that were saved by the UPS-Shutdown-Node script (can't use Get-ClusterNode as cluster not started yet)
$nodes = Get-Content -Path $NodeFile
Import-Module StarWindX
$ServiceName = "StarWindService"
do {
sleep 1
$s1 = Get-Service -ComputerName $nodes[0] -Name $ServiceName -ErrorAction SilentlyContinue
$s2 = Get-Service -ComputerName $nodes[1] -Name $ServiceName -ErrorAction SilentlyContinue
} until (($s1.Status -eq "Running") -and ($s2.Status -eq "Running"))
Write-Host "$ServiceName is running on both nodes"
Start-Sleep -Milliseconds (Get-Random -Maximum 5000)
# Take Starwind devices out of maintenance mode
Write-Host "Taking Starwind out of maintenance mode"
try {
$server = New-SWServer -host $NodeName -port 3261 -user root -password starwind
$server.Connect()
foreach ($device in $server.Devices) {
if (!$device) {
Write-Host "No device found" -foreground red
return
} else {
$disk = $device.Name
if ($device.Name -like "HAimage*") {
try {
$device.SwitchMaintenanceMode($false, $true)
Write-Host "$($disk): Exited maintenance mode"
}
catch {
Write-Host "$($disk): $($_)"
}
} else {
Write-Host "$($disk): Not an HA device"
}
}
}
}
catch {
Write-Host $_ -foreground red
}
finally {
$server.Disconnect()
}
# Set the Cluster and Virtual Machine Management services to automatic, and start the services
Write-Host "Setting cluster and vmms services to manual startup"
Get-Service -Name vmms | Set-Service -StartupType Automatic
Get-Service -Name vmms | Start-Service
Get-Service -Name ClusSvc | Set-Service -StartupType Automatic
Get-Service -Name ClusSvc | Start-Service
$c1 = 0
$c2 = 0
$c3 = 0
$c4 = 0
do {
$s1 = (Get-Service -ComputerName $nodes[0] -Name vmms).Status
$s2 = (Get-Service -ComputerName $nodes[0] -Name ClusSvc).Status
$s3 = (Get-Service -ComputerName $nodes[1] -Name vmms).Status
$s4 = (Get-Service -ComputerName $nodes[1] -Name ClusSvc).Status
if ($s1 -ne "Running") { $c1 += 1 } else { Write-Host "$($nodes[0]): Virtual Machine Management service running after $($c1) seconds" }
if ($s2 -ne "Running") { $c2 += 1 } else { Write-Host "$($nodes[0]): Cluster service running after $($c2) seconds" }
if ($s3 -ne "Running") { $c3 += 1 } else { Write-Host "$($nodes[1]): Virtual Machine Management service running after $($c3) seconds" }
if ($s4 -ne "Running") { $c4 += 1 } else { Write-Host "$($nodes[1]): Cluster service running after $($c4) seconds" }
if (($s1 -ne "Running") -or ($s2 -ne "Running") -or ($s3 -ne "Running") -or ($s4 -ne "Running")) { Sleep 1 }
} until ((($s1 -eq "Running") -and ($s2 -eq "Running") -and ($s3 -eq "Running") -and ($s4 -eq "Running")) -or ($c1 -gt $ServiceTimeout) -or ($c2 -gt $ServiceTimeout) -or ($c3 -gt $ServiceTimeout) -or ($c4 -gt $ServiceTimeout))
if (($c1 -gt $ServiceTimeout) -or ($c2 -gt $ServiceTimeout) -or ($c3 -gt $ServiceTimeout) -or ($c4 -gt $ServiceTimeout)) {
Write-Host "A service failed to start"
exit
}
# Make sure all CSVs are online
Write-Host "Waiting for all cluster shared volumes online"
do {
Sleep 1
$OfflineCSVs = Get-ClusterSharedVolume | ? {$_.State -ne 'Online'}
foreach ($OfflineCSV in $OfflineCSVs) { Start-ClusterResource -Name $OfflineCSV.Name -ErrorAction SilentlyContinue > $null 2>&1 }
} until (($OfflineCSVs | measure).Count -eq 0)
# Small pause to ensure cluster recognises that cluster shared volumes are up
Sleep 15
# Start the VMs on all nodes, and set any node to be the owner
Write-Host "Starting virtual machines"
Get-ClusterResource | ? {($_.ResourceType -eq "Virtual Machine") -and ($_.State -eq "Offline")} | Start-ClusterResource
Write-Host "Setting virtual machine possible owners to all nodes"
Get-ClusterResource | ? {($_.ResourceType -eq "Virtual Machine") -and ($_.State -eq "Offline")} | Set-ClusterOwnerNode -Owners (Get-ClusterNode).Name
Write-Host "Unregistering scheduled task"
Unregister-ScheduledTask -TaskName "Maintenance Mode Off" -Confirm:$false -ErrorAction SilentlyContinue