ESXi failover not reliable?

Software-based VM-centric and flash-friendly VM storage + free version

Moderators: anton (staff), art (staff), Max (staff), Anatoly (staff)

danswartz
Posts: 71
Joined: Fri May 03, 2019 7:21 pm

Wed Jul 10, 2019 2:11 am

I'm assuming I have something configured incorrectly. I have 2 ESXi 6.7 hosts in an HA cluster. 2 Windows Server 2016 appliances running in a synchronous replication mode. I installed the rescan powershell scripts and confirmed they are running, since they are setting the 2 hypervisors iSCSI HBA to RR mode, IOPS=1. Here's the issue I am having: sometimes, when I need to reboot one of the hypervisors, I shutdown the 2016 VSA, put the host in maintenance mode, then reboot. That should work (and sometimes does.) Unfortunately, several times now, what happens is: the other host loses iSCSI connection, and guests start crashing with unrecoverable disk errors. Once the other host is back up, all is okay. This is obviously not how this should be working. In the correct cases, in vcenter, I can see one dead path and one active path, so good. Unfortunately, in the bad case, vcenter server appliance is borked, so I am unable to examine its iscsi pathing to try to see what is happening. It's almost like the dead path detection isn't working properly. I vaguely recall seeing a post somewhere (which I now can't find) that alluded to a change vmware made in 6.7, where when a path fails, it very quickly checks the other path, and if that one is down, declares APD. It occurred to me that maybe there is a small window where starwind is not responding from the still-up VSA, and ESXi incorrectly declares everything down. If I could find that article, I'd try changing that setting, but I don't see to see it. Any thoughts or ideas welcome :)
danswartz
Posts: 71
Joined: Fri May 03, 2019 7:21 pm

Wed Jul 10, 2019 5:36 pm

Well, I think I figured out what is happening, and it has nothing to do with starwind - lol. The two hosts are linked directly via a 50gb/sec pair of mellanox cards. For maximum performance, I pass an SR-IOV virtual function in to each starwind VSA. The problem is this: when I shutdown host B, the NIC in host A loses link, and apparently the starwind VSA is unable to talk to the mellanox NIC on host A, so both paths are down, and vsphere bitches and moans :( Seems like 3 possible ways to fix this:

1. Replace the virtual function NIC with a vmxnet3 NIC. Pro: traffic never leaves the switch so this avoids loss of connectivity. Con: reduced performance.

2. Add a 1gb nic to each host's virtual switch and make it a failover NIC only - I don't care about reduced performance in that scenario. Pro: Same max performance in normal case. Con: need to add a 1gb link between the two hosts, and it can't be a direct link, or it would have the same issue.

3. Find some way to instruct vsphere and/or server 2016 to ignore link down state. Pro: simplest configuration. Con: awful kludge (and I don't even know if that is possible!)
Oleg(staff)
Staff
Posts: 568
Joined: Fri Nov 24, 2017 7:52 am

Fri Jul 12, 2019 9:32 am

Yes, you can go these ways.
Regarding VMXnet3 adapters, are you sure that your underlying storage is working faster than the performance of VMXnet3 adapters? :)
danswartz
Posts: 71
Joined: Fri May 03, 2019 7:21 pm

Fri Jul 12, 2019 4:51 pm

It is samsung 960 PRO NVME. Reads can exceed 2GB/sec, so probably yes...
danswartz
Posts: 71
Joined: Fri May 03, 2019 7:21 pm

Sat Jul 13, 2019 8:04 pm

Too many hassles with the teaming approach. Here's what I did instead:

Instead of RR, I use vmware fixed multipathing.

Each host has 3 paths to Starwind storage: 192.168.3.x (50gb) pointing to the local VSAN appliance, 192.168.4.x (local 1gb link) and 192.168.4.y (other node's 1gb link). I created a 2-port VLAN on the switch for the 1gb link, so as to avoid the problematic loss of ethernet carrier when a vsphere host is powered off. If the local VSAN appliance is down (microsoft update?), the fast and slow paths to it are dead, but the other host's slow path is still up, and no effect on the other host. If the entire host is down, the other host's fast link is down (due to aforesaid loss of carrier), as is the down host's slow link, but the up host's slow link is still up. It seems to work (and I can't think of any reason it shouldn't). The one piece of the puzzle I haven't yet addressed is: I had to disable the rescan powershell scripts until I figure out how to change them to repreference the fast link as appropriate.
Oleg(staff)
Staff
Posts: 568
Joined: Fri Nov 24, 2017 7:52 am

Wed Jul 17, 2019 1:08 pm

It is samsung 960 PRO NVME. Reads can exceed 2GB/sec, so probably yes...
VMXnet3 adapters can work faster.
In any case, Round Robin can provide better performance than fixed multipathing.
danswartz
Posts: 71
Joined: Fri May 03, 2019 7:21 pm

Wed Jul 17, 2019 1:24 pm

In this case, there is one fast link (50gb) and one slow link (1gb), so round robin is going to be a big loser, no?
User avatar
anton (staff)
Site Admin
Posts: 4008
Joined: Fri Jun 18, 2004 12:03 am
Location: British Virgin Islands
Contact:

Thu Jul 18, 2019 11:01 am

Slow one will be used for pings only and working pings on 1 GbE are faster than a response from improperly closed TCP connection over 50 GbE one.
danswartz wrote:In this case, there is one fast link (50gb) and one slow link (1gb), so round robin is going to be a big loser, no?
Regards,
Anton Kolomyeytsev

Chief Technology Officer & Chief Architect, StarWind Software

Image
danswartz
Posts: 71
Joined: Fri May 03, 2019 7:21 pm

Thu Jul 18, 2019 11:48 am

I'm confused. How does starwind know the 1gb link is too slow and only use for pings? Queue depth?
User avatar
anton (staff)
Site Admin
Posts: 4008
Joined: Fri Jun 18, 2004 12:03 am
Location: British Virgin Islands
Contact:

Thu Jul 18, 2019 11:59 am

We don't. You pick up synchronization channel and heartbeat network yourself.
danswartz wrote:I'm confused. How does starwind know the 1gb link is too slow and only use for pings? Queue depth?
Regards,
Anton Kolomyeytsev

Chief Technology Officer & Chief Architect, StarWind Software

Image
danswartz
Posts: 71
Joined: Fri May 03, 2019 7:21 pm

Thu Jul 18, 2019 1:09 pm

We are not communicating, I think :( I have a 50gb link and a 1gb link. I don't understand how there is any way round-robin can be performant in that scenario?
Boris (staff)
Staff
Posts: 805
Joined: Fri Jul 28, 2017 8:18 am

Sun Jul 21, 2019 12:46 pm

It looks like in your scenario the 1Gbps links should not be used to connect ESXi hosts to storage. Feel free to try using just your faster link, i.e. the 50Gbps one.
Also, can you point me to the script you use for storage rescan? Otherwise, just paste it here, but be sure to remove the credentials.
danswartz
Posts: 71
Joined: Fri May 03, 2019 7:21 pm

Sun Jul 21, 2019 1:48 pm

I am using the 1gb link only if the 50gb link is down due to the loss of carrier on point to point links when the other host is powered off. Standard vmware fixed, with 50gb preferred. I am using the recommended starwind powershell script (only I commented out the setting of round-robin).
Boris (staff)
Staff
Posts: 805
Joined: Fri Jul 28, 2017 8:18 am

Sun Jul 21, 2019 2:25 pm

With VMXNET3 used for the virtual adapter you will not have to worry about the partner host down. The local adapter will stay active.
Can you still show me the script? At StarWind Support we have seen several cases where ESXi 6.7 would fail rescanning the storage with the script used for 6.5. It can be your case, too.
danswartz
Posts: 71
Joined: Fri May 03, 2019 7:21 pm

Sun Jul 21, 2019 5:24 pm

Yes, good point about vmxnet3. I had tried that. I rejected that (for now at least) because that loses the ability of your VSA to provider iSER connectivity from ESXi. Here is the script (I deleted the round-robin policy setting line):

Import-Module VMware.PowerCLI
$counter = 1
if ($counter -eq 0){
Set-PowerCLIConfiguration -InvalidCertificateAction ignore -Confirm:$false | Out-Null
}
$ESXiHost = "10.0.0.5"
$ESXiUser = "root"
$ESXiPassword = "XXX"
Connect-VIServer $ESXiHost -User $ESXiUser -Password $ESXiPassword | Out-Null
Get-VMHostStorage $ESXiHost -RescanAllHba | Out-Null
Disconnect-VIServer $ESXiHost -Confirm:$false
$file = Get-Content "$PSScriptRoot\rescan_script.ps1"
if ($file[1] -ne "`$counter = 1") {
$file[1] = "`$counter = 1"
$file > "$PSScriptRoot\rescan_script.ps1"
}

p.s. not sure why you are concerned about the rescanning occuring. It does, but since the 50gb is down, even the host that is up cannot use the 50gb link for the SR-IOV pass through...
Post Reply