Page 1 of 2

XenServer handling of StarWind HA failure

Posted: Fri Mar 02, 2018 4:47 pm
by arthurr
Hello,

I'm using XenServer 7.0 configured per https://www.starwindsoftware.com/techni ... r-Pool.pdf with a StarWind HA cluster of 2 running 8.0.11456. I plan to update to XenServer 7.2; however, I've checked and the sm package is basically unchanged in all 7.0+ versions (including all patches). XenServer 7.2 wasn't yet supported by our backup appliance, Unitrends, at the time the pool was provisioned hence why it wasn't used.

The XenServer iSCSI failover has worked correctly before such as when a StarWind VM boot volume ran out of disk. When the StarWind HA fails it doesn't failover correctly. The XenServer sm logs show authentication failure as excepted but no attempt to try the other StarWind host. I can only repair the iSCSI SR by blocking outbound traffic to the faulted StarWind host via iptables then using XenCenter's SR repair. It is strange because the second path should already be enabled. Unfortunately I haven't been able to collect the right info: multipath command output, sm log, etc. Now I have to try to reproduce the issue which is rather painful.

Worse yet with XenServer HA enabled the whole XenServer pool reboots on this condition.

Hoping this is a known issue with some workaround or resolved in XenServer 7.2.

Thanks,
Arthur

Re: XenServer handling of StarWind HA failure

Posted: Fri Mar 02, 2018 4:51 pm
by arthurr
Also, XenServer support has been worthless as always with trying to resolve this.

Re: XenServer handling of StarWind HA failure

Posted: Mon Mar 05, 2018 5:18 pm
by Oleg(staff)
Hi arthurr,
StarWind supports Compute and Storage separated scenario in case of using XenServer.
Please give us more details about configuration. How StarWind configured, how many physical connections for iSCSI/Sync you have. How many channels configured for iSCSI, how many multipath paths show in XenServer GUI etc.
Can you please collect the logs from your StarWind VMs and StarWind, PM me for better understanding the problem you faced? Also, please specify more details about your system configuration. Please specify the actual date the issue occurred. Also, please check the recommendations for VMs which is running StarWind VSAN.
You can collect log using this tool.

Re: XenServer handling of StarWind HA failure

Posted: Sun Mar 11, 2018 5:40 pm
by arthurr
The issue is XenServer doesn't login to down iSCSI targets when the PDB status is attached. I created a XenServer bug on the open source tracker: https://bugs.xenserver.org/browse/XSO-848. You can get more info about my configuration there but nothing further needed from the StarWind side.

Re: XenServer handling of StarWind HA failure

Posted: Sun Mar 11, 2018 5:53 pm
by arthurr
To be clear, the difference in behavior had nothing to do with how it failed: what mattered is if all paths were up before the event.

Here's the worst case scenario:

1. Target 1 goes down
2. Target 1 fixed but not reconnected automatically by XenServer nor even when doing SR repair / xe pbd-plug.
3. Target 2 goes down. Still doesn't try to reconnect to path 1 automatically. SR repair / xe pbd-plug will reconnect path 1 except when dangling RH Device Mapper entries are found
4. VDI lock files result in some VM boot issues.
5. Target 2 fixed but not reconnected automatically by XenServer nor even when doing SR repair / xe pbd-plug.

You can do a SR detach then attach to restore all paths but all affected VMs have to be off. You can login with the isciadm command to restore down paths as a better workaround.

Another poorly handled scenario:
1. StarWind HA out-of-sync
2. XenServer boots.
3. Target 1 returns authentication failure due to sync state. XenServer doesn't try target 2 even on SR repair / xe pbd-plug.

The only workaround is using the iscsiadm command to connect the good target or block the bad target with firewall.

Re: XenServer handling of StarWind HA failure

Posted: Sun Mar 11, 2018 6:04 pm
by arthurr

Re: XenServer handling of StarWind HA failure

Posted: Mon Mar 12, 2018 4:50 am
by Boris (staff)
arthurr,

Thank you for drawing attention of Xen developers to this. Hope they will fix their issue the soonest. Keep us informed on their investigation results.

Re: XenServer handling of StarWind HA failure

Posted: Mon Mar 12, 2018 2:28 pm
by arthurr
Well I see it got assigned to a team but yeah no idea how responsive they'll be. Given their track record I'm not holding my breath. I'm going to try to reproduce this on a testing instance then contact XenServer support. Anything you guys can do given you've gone through the certification process and may have some internal contacts would be helpful.

Is there any workaround from the StarWind side to block access differently to an out-of-sync node?

Re: XenServer handling of StarWind HA failure

Posted: Mon Mar 12, 2018 5:11 pm
by arthurr
Actually, the XenServer multpath recovery works fine as long as there is no need to login to StarWind so the reason for failure does matter.

Re: XenServer handling of StarWind HA failure

Posted: Tue Mar 13, 2018 12:12 am
by arthurr
I created the following perl script which uses PBD metadata / isciadm command to monitor iSCSI sessions and login to down targets. This has to be deployed to each XenServer pool member as a frequently run cron job or similar. Use at your own risk.

Code: Select all

#!/usr/bin/env perl
use Getopt::Long;

use strict;
use warnings;

my $debug = 0;
my $report_only = 0;
my $host_uuid;
my $sr_uuid;

GetOptions (
    'sr-uuid=s' => \$sr_uuid,
    'debug' => \$debug
);

my $uuid_regex = '[a-z0-9]{8}-[a-z0-9]{4}-[a-z0-9]{4}-[a-z0-9]{4}-[a-z0-9]{12}';

my $exit_code = 0;

if ($sr_uuid !~ /$uuid_regex/) {
    print STDERR "Invalid format host UUID specified: $sr_uuid\n";
    exit 2;
}

open(my $xensource_inventory_fh, '< /etc/xensource-inventory') or die("Unable to read /etc/xensource-inventory: $!");
while(<$xensource_inventory_fh>) {
	chomp;
	if ($_ =~ /INSTALLATION_UUID='($uuid_regex)'/) {
		$host_uuid = $1;
		last;
	}
}

if (! defined($host_uuid)) {
	print STDERR "Unable to determine host-uuid\n";
	exit 2;
}

my @sessions;
print  "iscsiadm -m session\n" if $debug;
open(my $iscsiadm_session_stdout, '-|', 'iscsiadm', '-m', 'session');
while(<$iscsiadm_session_stdout>) {
	chomp;
	if (my @session = $_ =~ /.*?\]\s+(.*?):[0-9}+,[0-9]+\s+(iqn[^\s]+)/) {
		push(@sessions, \@session);
	}
}
close $iscsiadm_session_stdout;
exit $? if $? != 0;

my $pbd_uuid;
print "xe pbd-list host-uuid=$host_uuid sr-uuid=$sr_uuid\n" if $debug;
open(my $xe_pbd_list_stdout, '-|', 'xe', 'pbd-list', "host-uuid=$host_uuid", "sr-uuid=$sr_uuid");
while(<$xe_pbd_list_stdout>) {
	chomp;
	if ($_ =~ /^uuid.*($uuid_regex)/) {
		$pbd_uuid = $1;
	}
}
close($xe_pbd_list_stdout);
exit $? if $? != 0;

if (defined($pbd_uuid)) {
    my $multi_session_config;
    print "xe pbd-param-get param-name=device-config uuid=$pbd_uuid\n" if $debug;
    open(my $xe_pbd_param_get_stdout, '-|', 'xe', 'pbd-param-get', 'param-name=device-config', "uuid=$pbd_uuid");
    while(<$xe_pbd_param_get_stdout>) {
		chomp;
		if ($_ =~ /^multiSession.*: (.*?)\|;/) {
			$multi_session_config = $1;
		}
    }
    close($xe_pbd_param_get_stdout);
    exit $? if $? != 0;

    if (defined($multi_session_config)) {
		my @paths = split(/\|/, $multi_session_config);
		if (scalar @paths < 1) {
			print STDERR "Unable to parse 'xe pbd-param-get' command output\n";
			exit 2;
		} else {
			foreach my $path (@paths) {
				my $match_found;
				my @path_settings = split(/,/, $path);
				foreach my $session (@sessions) {
					if ($$session[0] eq $path_settings[0] && $$session[1] eq $path_settings[2]) {
				        $match_found = 1;
					}
				}
				if (! $match_found) {
					print STDERR "No session found for $path_settings[2] on $path_settings[0]\n";
					if (! $report_only) {
						print "iscsiadm -m node -T $path_settings[2] -p $path_settings[0] -l\n" if $debug;
						system('iscsiadm', '-m', 'node', '-T', $path_settings[2], '-p', $path_settings[0], '-l');
						$exit_code = 1 if $? != 0;
					}
				}
			}
		}
    } else {
		print STDERR "Unable to parse 'xe pbd-param-get' command output\n";
		exit 2;
    }
} else {
	print STDERR "Unable to locate PDB for host-uuid=$host_uuid and sr-uuid=$sr_uuid.  Please check Host and SR UUID then try SR repair.\n";
}

exit $exit_code;

Re: XenServer handling of StarWind HA failure

Posted: Wed Mar 14, 2018 9:23 am
by Boris (staff)
I believe this script might become useful for someone.
By the way, are there any news from Xen developers on the issue?

Re: XenServer handling of StarWind HA failure

Posted: Wed Mar 14, 2018 4:27 pm
by arthurr
No update from Xen developers. I am setting up a test instance with a licensed XenServer 7.4 so I can report to XenServer support too.

Re: XenServer handling of StarWind HA failure

Posted: Fri Mar 16, 2018 3:17 pm
by arthurr
Here's a scenario that just happened.

1. Took down a StarWind node using service shutdown
2. When the StarWind node was restored it had to do a full sync.
3. When the full sync finished Xen restored the path with no admin action.

Does StarWind logout sessions from the out-sync instance(s) upon HA failure?

So far Xen developer response has been working as designed.

Re: XenServer handling of StarWind HA failure

Posted: Fri Mar 16, 2018 4:27 pm
by Boris (staff)
StarWind blocks iSCSI connections to a device that is in a state other than "Synchronized". This is done to prevent data corruption in case of accidental writes in such situations.

Re: XenServer handling of StarWind HA failure

Posted: Fri Mar 16, 2018 5:39 pm
by arthurr
Understood but trying to figure out when the session is no longer reported with "iscsiadm -m session". I've already tried a ban and disconnect in StarWind and adding CHAP permissions without updating client config: neither remove the session from the client. As long as the client thinks it has a session it will try to re-establish.

May I ask what scenarios were tested during the certification process?