XenServer handling of StarWind HA failure

Software-based VM-centric and flash-friendly VM storage + free version

Moderators: anton (staff), art (staff), Max (staff), Anatoly (staff)

arthurr
Posts: 14
Joined: Fri Mar 02, 2018 4:37 pm

Fri Mar 02, 2018 4:47 pm

Hello,

I'm using XenServer 7.0 configured per https://www.starwindsoftware.com/techni ... r-Pool.pdf with a StarWind HA cluster of 2 running 8.0.11456. I plan to update to XenServer 7.2; however, I've checked and the sm package is basically unchanged in all 7.0+ versions (including all patches). XenServer 7.2 wasn't yet supported by our backup appliance, Unitrends, at the time the pool was provisioned hence why it wasn't used.

The XenServer iSCSI failover has worked correctly before such as when a StarWind VM boot volume ran out of disk. When the StarWind HA fails it doesn't failover correctly. The XenServer sm logs show authentication failure as excepted but no attempt to try the other StarWind host. I can only repair the iSCSI SR by blocking outbound traffic to the faulted StarWind host via iptables then using XenCenter's SR repair. It is strange because the second path should already be enabled. Unfortunately I haven't been able to collect the right info: multipath command output, sm log, etc. Now I have to try to reproduce the issue which is rather painful.

Worse yet with XenServer HA enabled the whole XenServer pool reboots on this condition.

Hoping this is a known issue with some workaround or resolved in XenServer 7.2.

Thanks,
Arthur
arthurr
Posts: 14
Joined: Fri Mar 02, 2018 4:37 pm

Fri Mar 02, 2018 4:51 pm

Also, XenServer support has been worthless as always with trying to resolve this.
Oleg(staff)
Staff
Posts: 568
Joined: Fri Nov 24, 2017 7:52 am

Mon Mar 05, 2018 5:18 pm

Hi arthurr,
StarWind supports Compute and Storage separated scenario in case of using XenServer.
Please give us more details about configuration. How StarWind configured, how many physical connections for iSCSI/Sync you have. How many channels configured for iSCSI, how many multipath paths show in XenServer GUI etc.
Can you please collect the logs from your StarWind VMs and StarWind, PM me for better understanding the problem you faced? Also, please specify more details about your system configuration. Please specify the actual date the issue occurred. Also, please check the recommendations for VMs which is running StarWind VSAN.
You can collect log using this tool.
arthurr
Posts: 14
Joined: Fri Mar 02, 2018 4:37 pm

Sun Mar 11, 2018 5:40 pm

The issue is XenServer doesn't login to down iSCSI targets when the PDB status is attached. I created a XenServer bug on the open source tracker: https://bugs.xenserver.org/browse/XSO-848. You can get more info about my configuration there but nothing further needed from the StarWind side.
arthurr
Posts: 14
Joined: Fri Mar 02, 2018 4:37 pm

Sun Mar 11, 2018 5:53 pm

To be clear, the difference in behavior had nothing to do with how it failed: what mattered is if all paths were up before the event.

Here's the worst case scenario:

1. Target 1 goes down
2. Target 1 fixed but not reconnected automatically by XenServer nor even when doing SR repair / xe pbd-plug.
3. Target 2 goes down. Still doesn't try to reconnect to path 1 automatically. SR repair / xe pbd-plug will reconnect path 1 except when dangling RH Device Mapper entries are found
4. VDI lock files result in some VM boot issues.
5. Target 2 fixed but not reconnected automatically by XenServer nor even when doing SR repair / xe pbd-plug.

You can do a SR detach then attach to restore all paths but all affected VMs have to be off. You can login with the isciadm command to restore down paths as a better workaround.

Another poorly handled scenario:
1. StarWind HA out-of-sync
2. XenServer boots.
3. Target 1 returns authentication failure due to sync state. XenServer doesn't try target 2 even on SR repair / xe pbd-plug.

The only workaround is using the iscsiadm command to connect the good target or block the bad target with firewall.
Last edited by arthurr on Mon Mar 12, 2018 2:24 pm, edited 1 time in total.
arthurr
Posts: 14
Joined: Fri Mar 02, 2018 4:37 pm

Sun Mar 11, 2018 6:04 pm

Boris (staff)
Staff
Posts: 805
Joined: Fri Jul 28, 2017 8:18 am

Mon Mar 12, 2018 4:50 am

arthurr,

Thank you for drawing attention of Xen developers to this. Hope they will fix their issue the soonest. Keep us informed on their investigation results.
arthurr
Posts: 14
Joined: Fri Mar 02, 2018 4:37 pm

Mon Mar 12, 2018 2:28 pm

Well I see it got assigned to a team but yeah no idea how responsive they'll be. Given their track record I'm not holding my breath. I'm going to try to reproduce this on a testing instance then contact XenServer support. Anything you guys can do given you've gone through the certification process and may have some internal contacts would be helpful.

Is there any workaround from the StarWind side to block access differently to an out-of-sync node?
arthurr
Posts: 14
Joined: Fri Mar 02, 2018 4:37 pm

Mon Mar 12, 2018 5:11 pm

Actually, the XenServer multpath recovery works fine as long as there is no need to login to StarWind so the reason for failure does matter.
arthurr
Posts: 14
Joined: Fri Mar 02, 2018 4:37 pm

Tue Mar 13, 2018 12:12 am

I created the following perl script which uses PBD metadata / isciadm command to monitor iSCSI sessions and login to down targets. This has to be deployed to each XenServer pool member as a frequently run cron job or similar. Use at your own risk.

Code: Select all

#!/usr/bin/env perl
use Getopt::Long;

use strict;
use warnings;

my $debug = 0;
my $report_only = 0;
my $host_uuid;
my $sr_uuid;

GetOptions (
    'sr-uuid=s' => \$sr_uuid,
    'debug' => \$debug
);

my $uuid_regex = '[a-z0-9]{8}-[a-z0-9]{4}-[a-z0-9]{4}-[a-z0-9]{4}-[a-z0-9]{12}';

my $exit_code = 0;

if ($sr_uuid !~ /$uuid_regex/) {
    print STDERR "Invalid format host UUID specified: $sr_uuid\n";
    exit 2;
}

open(my $xensource_inventory_fh, '< /etc/xensource-inventory') or die("Unable to read /etc/xensource-inventory: $!");
while(<$xensource_inventory_fh>) {
	chomp;
	if ($_ =~ /INSTALLATION_UUID='($uuid_regex)'/) {
		$host_uuid = $1;
		last;
	}
}

if (! defined($host_uuid)) {
	print STDERR "Unable to determine host-uuid\n";
	exit 2;
}

my @sessions;
print  "iscsiadm -m session\n" if $debug;
open(my $iscsiadm_session_stdout, '-|', 'iscsiadm', '-m', 'session');
while(<$iscsiadm_session_stdout>) {
	chomp;
	if (my @session = $_ =~ /.*?\]\s+(.*?):[0-9}+,[0-9]+\s+(iqn[^\s]+)/) {
		push(@sessions, \@session);
	}
}
close $iscsiadm_session_stdout;
exit $? if $? != 0;

my $pbd_uuid;
print "xe pbd-list host-uuid=$host_uuid sr-uuid=$sr_uuid\n" if $debug;
open(my $xe_pbd_list_stdout, '-|', 'xe', 'pbd-list', "host-uuid=$host_uuid", "sr-uuid=$sr_uuid");
while(<$xe_pbd_list_stdout>) {
	chomp;
	if ($_ =~ /^uuid.*($uuid_regex)/) {
		$pbd_uuid = $1;
	}
}
close($xe_pbd_list_stdout);
exit $? if $? != 0;

if (defined($pbd_uuid)) {
    my $multi_session_config;
    print "xe pbd-param-get param-name=device-config uuid=$pbd_uuid\n" if $debug;
    open(my $xe_pbd_param_get_stdout, '-|', 'xe', 'pbd-param-get', 'param-name=device-config', "uuid=$pbd_uuid");
    while(<$xe_pbd_param_get_stdout>) {
		chomp;
		if ($_ =~ /^multiSession.*: (.*?)\|;/) {
			$multi_session_config = $1;
		}
    }
    close($xe_pbd_param_get_stdout);
    exit $? if $? != 0;

    if (defined($multi_session_config)) {
		my @paths = split(/\|/, $multi_session_config);
		if (scalar @paths < 1) {
			print STDERR "Unable to parse 'xe pbd-param-get' command output\n";
			exit 2;
		} else {
			foreach my $path (@paths) {
				my $match_found;
				my @path_settings = split(/,/, $path);
				foreach my $session (@sessions) {
					if ($$session[0] eq $path_settings[0] && $$session[1] eq $path_settings[2]) {
				        $match_found = 1;
					}
				}
				if (! $match_found) {
					print STDERR "No session found for $path_settings[2] on $path_settings[0]\n";
					if (! $report_only) {
						print "iscsiadm -m node -T $path_settings[2] -p $path_settings[0] -l\n" if $debug;
						system('iscsiadm', '-m', 'node', '-T', $path_settings[2], '-p', $path_settings[0], '-l');
						$exit_code = 1 if $? != 0;
					}
				}
			}
		}
    } else {
		print STDERR "Unable to parse 'xe pbd-param-get' command output\n";
		exit 2;
    }
} else {
	print STDERR "Unable to locate PDB for host-uuid=$host_uuid and sr-uuid=$sr_uuid.  Please check Host and SR UUID then try SR repair.\n";
}

exit $exit_code;
Last edited by arthurr on Wed Mar 14, 2018 4:44 pm, edited 1 time in total.
Boris (staff)
Staff
Posts: 805
Joined: Fri Jul 28, 2017 8:18 am

Wed Mar 14, 2018 9:23 am

I believe this script might become useful for someone.
By the way, are there any news from Xen developers on the issue?
arthurr
Posts: 14
Joined: Fri Mar 02, 2018 4:37 pm

Wed Mar 14, 2018 4:27 pm

No update from Xen developers. I am setting up a test instance with a licensed XenServer 7.4 so I can report to XenServer support too.
arthurr
Posts: 14
Joined: Fri Mar 02, 2018 4:37 pm

Fri Mar 16, 2018 3:17 pm

Here's a scenario that just happened.

1. Took down a StarWind node using service shutdown
2. When the StarWind node was restored it had to do a full sync.
3. When the full sync finished Xen restored the path with no admin action.

Does StarWind logout sessions from the out-sync instance(s) upon HA failure?

So far Xen developer response has been working as designed.
Boris (staff)
Staff
Posts: 805
Joined: Fri Jul 28, 2017 8:18 am

Fri Mar 16, 2018 4:27 pm

StarWind blocks iSCSI connections to a device that is in a state other than "Synchronized". This is done to prevent data corruption in case of accidental writes in such situations.
arthurr
Posts: 14
Joined: Fri Mar 02, 2018 4:37 pm

Fri Mar 16, 2018 5:39 pm

Understood but trying to figure out when the session is no longer reported with "iscsiadm -m session". I've already tried a ban and disconnect in StarWind and adding CHAP permissions without updating client config: neither remove the session from the client. As long as the client thinks it has a session it will try to re-establish.

May I ask what scenarios were tested during the certification process?
Post Reply