VMotion is not working / Sync Problems

Software-based VM-centric and flash-friendly VM storage + free version

Moderators: anton (staff), art (staff), Max (staff), Anatoly (staff)

Post Reply
umg
Posts: 4
Joined: Tue Apr 14, 2020 1:30 pm

Tue Apr 14, 2020 2:47 pm

Hey everybody,

I started to use starwind vsan in a 2-Node HA. First of all everything seems fine I deployed the first vms and it was looking good but then i an into problems.
When the cluster is in synchronized state I could not storage VMotion a VM to the starwind datastore.

In the logfile I got the following message:

4/14 15:13:57.227060 182c HA: CDistributedCriticalSection::PushCommandFromSharer: (CSid = 0xCCC5) sharerID = 0xD1, cmdID = 1, result = BUSY (section is busy now).
4/14 15:13:57.227330 11ac HA: CDistributedCriticalSection::EnterOnTheDistributeSharersSide: (CSid = 0xCCC5) ENTER command on dist sharer(Id = 0xD1) can't be complete. Sharer return BUSY.
4/14 15:13:57.227381 11ac HA: CDistributedCriticalSection::Enter: (CSid = 0xCCC5) retry #1 after sleep 189 ms...

But on the other hand I have no problem to deploy a VM from a template to the starwind datastore.

The next problem was when I tried to do some stability tests and shutted down one node. After the Host is coming back I had to start the sync manually with the HASyncDevice script.

Any suggestions what happens wrong?
For installation I used the CreateHA_2.ps1 script:

Code: Select all

param($addr="10.x.31.10", $port=3261, $user="root", $password="starwind",
	$addr2="10.x.31.11", $port2=$port, $user2=$user, $password2=$password,
#common
	$initMethod="Clear",
	$size=10485760,
	$sectorSize=512,
	$failover=0,
#primary node
	$imagePath="My computer\E\StoragePool",
	$imageName="masterImg21",
	$createImage=$true,
	$storageName="",
	$targetAlias="targetha21",
	$autoSynch=$true,
	$poolName="pool1",
	$syncSessionCount=1,
	$aluaOptimized=$true,
	$cacheMode="none",
	$cacheSize=0,
	$syncInterface="#p2=10.x.32.11:3260" -f $addr2,
	$hbInterface="#p2=10.x.31.11:3260" -f $addr2,
	$createTarget=$true,
#secondary node
	$imagePath2="My computer\E\StoragePool",
	$imageName2="partnerImg22",
	$createImage2=$true,
	$storageName2="",
	$targetAlias2="partnerha22",
	$autoSynch2=$true,
	$poolName2="pool1",
	$syncSessionCount2=1,
	$aluaOptimized2=$true,
	$cacheMode2=$cacheMode,
	$cacheSize2=$cacheSize,
	$syncInterface2="#p1=10.x.32.10:3260" -f $addr,
	$hbInterface2="#p1=10.x.31.10:3260" -f $addr,
	$createTarget2=$true
	)
	
Import-Module StarWindX

try
{
	Enable-SWXLog

	$server = New-SWServer -host $addr -port $port -user $user -password $password

	$server.Connect()

	$firstNode = new-Object Node

	$firstNode.HostName = $addr
	$firstNode.HostPort = $port
	$firstNode.Login = $user
	$firstNode.Password = $password
	$firstNode.ImagePath = $imagePath
	$firstNode.ImageName = $imageName
	$firstNode.Size = $size
	$firstNode.CreateImage = $createImage
	$firstNode.StorageName = $storageName
	$firstNode.TargetAlias = $targetAlias
	$firstNode.AutoSynch = $autoSynch
	$firstNode.SyncInterface = $syncInterface
	$firstNode.HBInterface = $hbInterface
	$firstNode.PoolName = $poolName
	$firstNode.SyncSessionCount = $syncSessionCount
	$firstNode.ALUAOptimized = $aluaOptimized
	$firstNode.CacheMode = $cacheMode
	$firstNode.CacheSize = $cacheSize
	$firstNode.FailoverStrategy = $failover
	$firstNode.CreateTarget = $createTarget
    
	#
	# device sector size. Possible values: 512 or 4096(May be incompatible with some clients!) bytes. 
	#
	$firstNode.SectorSize = $sectorSize
    
	$secondNode = new-Object Node

	$secondNode.HostName = $addr2
	$secondNode.HostPort = $port2
	$secondNode.Login = $user2
	$secondNode.Password = $password2
	$secondNode.ImagePath = $imagePath2
	$secondNode.ImageName = $imageName2
	$secondNode.CreateImage = $createImage2
	$secondNode.StorageName = $storageName2
	$secondNode.TargetAlias = $targetAlias2
	$secondNode.AutoSynch = $autoSynch2
	$secondNode.SyncInterface = $syncInterface2
	$secondNode.HBInterface = $hbInterface2
	$secondNode.SyncSessionCount = $syncSessionCount2
	$secondNode.ALUAOptimized = $aluaOptimized2
	$secondNode.CacheMode = $cacheMode2
	$secondNode.CacheSize = $cacheSize2
	$secondNode.FailoverStrategy = $failover
	$secondNode.CreateTarget = $createTarget2
        
	$device = Add-HADevice -server $server -firstNode $firstNode -secondNode $secondNode -initMethod $initMethod
    
	while ($device.SyncStatus -ne [SwHaSyncStatus]::SW_HA_SYNC_STATUS_SYNC)
	{
		$syncPercent = $device.GetPropertyValue("ha_synch_percent")
	        Write-Host "Synchronizing: $($syncPercent)%" -foreground yellow

		Start-Sleep -m 2000

		$device.Refresh()
	}
}
catch
{
	Write-Host $_ -foreground red 
}
finally
{
	$server.Disconnect()
}
yaroslav (staff)
Staff
Posts: 2277
Joined: Mon Nov 18, 2019 11:11 am

Tue Apr 14, 2020 4:04 pm

Greetings,

Did other VMs migrate well? Any errors from ESXi would be helpful.
For installation I used the CreateHA_2.ps1 script:

Code: Select all

param($addr="10.x.31.10", $port=3261, $user="root", $password="starwind",
   $addr2="10.x.31.11", $port2=$port, $user2=$user, $password2=$password,
That means that the servers are discovered over synchronization interfaces. And, that's not what we generally do. On top of that, consider using one more heartbeat channel (e.g., management link).
umg
Posts: 4
Joined: Tue Apr 14, 2020 1:30 pm

Tue Apr 14, 2020 10:03 pm

Hi yaroslav,

the IPs on top of the config file refers to $hbInterface, and $hbInterface2 which are the management/heartbeat links or am I wrong?
The next interesting thing is that VMotion works when only one starwind instance is presented to the esxi hosts (while the other host is syncing).

This is what i could find in the esxi logs where VMotion is not working:

Code: Select all

2020-04-14T13:23:32.124Z| Worker#1| I125: SVMotion: Enter Phase 8
2020-04-14T13:23:32.125Z| Worker#1| I125: Disk/File copy started for /vmfs/volumes/vmname/vmname.vmdk.
2020-04-14T13:23:42.129Z| vmx| W115: SVMotion: scsi0:2: Disk transfer rate slow: 0 kB/s over the last 10.00 seconds, copied total 0 MB at 0 kB/s.
2020-04-14T13:23:52.133Z| vmx| W115: SVMotion: scsi0:2: Disk transfer rate slow: 0 kB/s over the last 10.00 seconds, copied total 0 MB at 0 kB/s.
2020-04-14T13:24:02.137Z| vmx| W115: SVMotion: scsi0:2: Disk transfer rate slow: 0 kB/s over the last 10.00 seconds, copied total 0 MB at 0 kB/s.
2020-04-14T13:24:12.142Z| vmx| W115: SVMotion: scsi0:2: Disk transfer rate slow: 0 kB/s over the last 10.00 seconds, copied total 0 MB at 0 kB/s.
2020-04-14T13:24:22.142Z| vmx| W115: SVMotion: scsi0:2: Disk transfer rate slow: 0 kB/s over the last 10.00 seconds, copied total 0 MB at 0 kB/s.
2020-04-14T13:24:32.144Z| vmx| W115: SVMotion: scsi0:2: Disk transfer rate slow: 0 kB/s over the last 10.00 seconds, copied total 0 MB at 0 kB/s.
2020-04-14T13:24:42.148Z| vmx| W115: SVMotion: scsi0:2: Disk transfer rate slow: 0 kB/s over the last 10.00 seconds, copied total 0 MB at 0 kB/s.
2020-04-14T13:24:52.151Z| vmx| W115: SVMotion: scsi0:2: Disk transfer rate slow: 0 kB/s over the last 10.00 seconds, copied total 0 MB at 0 kB/s.
2020-04-14T13:25:02.155Z| vmx| W115: SVMotion: scsi0:2: Disk transfer rate slow: 0 kB/s over the last 10.00 seconds, copied total 0 MB at 0 kB/s.
2020-04-14T13:25:12.160Z| vmx| W115: SVMotion: scsi0:2: Disk transfer rate slow: 0 kB/s over the last 10.01 seconds, copied total 0 MB at 0 kB/s.
2020-04-14T13:25:22.162Z| vmx| W115: SVMotion: scsi0:2: Disk transfer rate slow: 0 kB/s over the last 10.00 seconds, copied total 0 MB at 0 kB/s.
2020-04-14T13:25:32.170Z| vmx| W115: SVMotion: scsi0:2: Disk transfer rate slow: 0 kB/s over the last 10.01 seconds, copied total 0 MB at 0 kB/s.
2020-04-14T13:25:42.152Z| Worker#0| I125: Migrate: Remote Log: Destination waited for 139.75 seconds.
2020-04-14T13:25:42.168Z| vcpu-0| I125: MigrateSetStateFinished: type=1 new state=6
2020-04-14T13:25:42.168Z| vcpu-0| I125: MigrateSetState: Transitioning from state 3 to 6.
2020-04-14T13:25:42.168Z| vcpu-0| A100: ConfigDB: Setting config.readOnly = "FALSE"
2020-04-14T13:25:42.168Z| vcpu-0| I125: Migrate_SetFailureMsgList: switching to new log file.
2020-04-14T13:25:42.169Z| vcpu-0| I125: Migrate_SetFailureMsgList: Now in new log file.
2020-04-14T13:25:42.170Z| vcpu-0| I125: Migrate: Caching migration error message list:
2020-04-14T13:25:42.170Z| vcpu-0| I125: [msg.checkpoint.precopyfailure] Migration to host <10.100.x.x> failed with error Connection closed by remote host, possibly due to timeout (0xbad003f).
2020-04-14T13:25:42.170Z| vcpu-0| I125: [vob.vmotion.stream.keepalive.read.fail] vMotion migration [a64020b:8817324592913087877] failed to read stream keepalive: Connection closed by remote host, possibly due to timeout
2020-04-14T13:25:42.170Z| vcpu-0| I125: Migrate: cleaning up migration state.
2020-04-14T13:25:42.170Z| vcpu-0| I125: SVMotion: Enter Phase 12
2020-04-14T13:25:42.170Z| vcpu-0| I125: SVMotion_Cleanup: Scheduling cleanup thread.
2020-04-14T13:25:42.170Z| vcpu-0| I125: Msg_Post: Error
2020-04-14T13:25:42.170Z| Worker#0| I125: SVMotionCleanupThread: Waiting for SVMotion Bitmap thread to complete.
2020-04-14T13:25:42.170Z| vcpu-0| I125: [vob.vmotion.stream.keepalive.read.fail] vMotion migration [a64020b:8817324592913087877] failed to read stream keepalive: Connection closed by remote host, possibly due to timeout
2020-04-14T13:25:42.170Z| Worker#1| W115: SVMotionMirroredModeThreadDiskCopy: Found internal error when woken up on diskCopySemaphore. Aborting storage vmotion.
2020-04-14T13:25:42.170Z| vcpu-0| I125: [msg.checkpoint.precopyfailure] Migration to host <10.100.x.x> failed with error Connection closed by remote host, possibly due to timeout (0xbad003f).
2020-04-14T13:25:42.170Z| vcpu-0| I125: ----------------------------------------
2020-04-14T13:25:42.170Z| Worker#1| W115: SVMotionCopyThread: disk copy failed. Canceling Storage vMotion.
2020-04-14T13:25:42.174Z| Worker#0| I125: SVMotionCleanupThread: Waiting for SVMotion thread to complete.
2020-04-14T13:25:42.174Z| Worker#1| I125: SVMotionCopyThread: Waiting for SVMotion Bitmap thread to complete before issuing a stun during migration failure cleanup.
2020-04-14T13:25:42.174Z| Worker#1| I125: SVMotion: FailureCleanup thread completes.
2020-04-14T13:25:42.174Z| Worker#0| I125: SVMotionCleanupThread: Waiting for final stun/unstun to finish
2020-04-14T13:25:42.174Z| vmx| I125: SVMotion: Worker thread performing SVMotionCopyThreadDone exited.
2020-04-14T13:25:42.177Z| vmx| I125: MigrateSetStateFinished: type=1 new state=6
2020-04-14T13:25:42.177Z| vmx| I125: MigrateSetState: Transitioning from state 6 to 6.
2020-04-14T13:25:42.177Z| vmx| A100: ConfigDB: Setting config.readOnly = "FALSE"
Greetings!
yaroslav (staff)
Staff
Posts: 2277
Joined: Mon Nov 18, 2019 11:11 am

Thu Apr 16, 2020 3:28 pm

Hi umg,
the IPs on top of the config file refers to $hbInterface, and $hbInterface2 which are the management/heartbeat links or am I wrong?
Absolutely right, but you need to refer to them as heartbeat interfaces too.
The next interesting thing is that VMotion works when only one starwind instance is presented to the esxi hosts (while the other host is syncing).
Sorry, did you mean that VMotion works only if 2 HA devices are synchronized?
If so, please check if there are partner connections on both hosts. See if in Software iSCSI if iSCSI paths are active for all hosts.

Code: Select all

2020-04-14T13:23:32.124Z| Worker#1| I125: SVMotion: Enter Phase 8
VMotion fails when Center completes the task (got it from here). Check the VMotion settings, please and see if storage is available over all pathes. Here was a similar problem https://communities.vmware.com/thread/558818.
Please check the VMotion settings.
umg
Posts: 4
Joined: Tue Apr 14, 2020 1:30 pm

Fri Apr 24, 2020 12:35 pm

Hello Yaroslav,

your colleague was so kind to send me a trial license to figure out the problem. I completly recreated the ha with:

1x Sync interface
1x Heartbeat on the iSCSI interface for VMware
1x Heartbeat on Management interface

But I still have problems with the cluster. VMotion behaves like before, it still is not possible to migrate vms to starwind datastore.
What sometimes works is when I shut down one starwind host then it is possible to vmotion.

I checked on both esxi hosts that multipath is set to round robin and nic teaming is disabled on all starwind portgroups.

In starwind log I find the message a lot:

4/24 11:52:40.178015 608 HA: HANode::completeSscCommandRequest: (0x4D) CHECK_CONDITION , sense: 0x5 0x20/0x0 returned.
4/24 11:53:32.265695 17b4 HA: *** HANode::sscRequestTaskExecute: SCSI opcode 0x4D is not supported!

I have no ideas anymore what I could do to get it running.

Thanks in advance
umg
Posts: 4
Joined: Tue Apr 14, 2020 1:30 pm

Mon Apr 27, 2020 2:17 pm

Update:

I was looking in this forum for similar problems and I got kind of a problem like in this post:

https://forums.starwindsoftware.com/vie ... 143#p29408

Have you found a solution for the problem?

Best regards
yaroslav (staff)
Staff
Posts: 2277
Joined: Mon Nov 18, 2019 11:11 am

Mon Apr 27, 2020 6:52 pm

Hello,

Sorry for the delay.
We need full logs from both servers and VMs. Please use StarWind Log Collector https://knowledgebase.starwindsoftware. ... collector/. Transfer the logs via Google drive or Dropbox. ESXi support logs will be helpful too. Please trigger the issue and do the migration (well, just perform it in a way it works) shortly before collecting the logs.
viewtopic.php?f=5&t=5264&p=29408&hilit=vmotion&sid=2d81355d4edbb2a8f73dfaffcf160143#p29408

Have you found a solution for the problem?
Yes. There was an outdated build (not the case for you). There was an issue with that very build of StarWind VSAN for Hyper-V.
There was a trick with ATS heartbeat.
To disable ATS heartbeat, run either the CLI command or the PowerCLI command:
Command line:
# esxcli system settings advanced set -i 0 -o /VMFS3/UseATSForHBOnVMFS5
PowerCLI:
Get-AdvancedSetting -Entity VMHost-Name -Name VMFS3.UseATSForHBOnVMFS5 | Set-AdvancedSetting -Value 0 -Confirm:$false
Note: Even though the command parameter says VMFS5, it also applies to VMFS6.
yaroslav (staff)
Staff
Posts: 2277
Joined: Mon Nov 18, 2019 11:11 am

Wed Apr 29, 2020 3:03 pm

Hi umg,

There might be a connectivity issue. Maybe, iSCSI network is utilized heavily by iSCSI traffic and vMotion just cannot happen.
Please consider network reconfiguration or off-loading vMotion to some other network.
Post Reply