Hyper-V Failover Cluster MPIO Issue

Branin · Mon May 25, 2015 9:19 pm

I'm creating a 2-node cluster as outlined https://www.starwindsoftware.com/starwi ... -v-cluster, and attempting to follow the directions as exactly as possible. However, I'm running into 2 small issues:

1) Page 13 says to essentially repeat the steps on the second node that were done on the first node and even gives a screenshot. Unfortunately, I don't have the ability to connect to the witness target on my first node through 127.0.0.1 (obviously). I can connect to the witness target on my second node through 127.0.0.1, which I've done. But this doesn't match the screenshot (which apparently has the witness target on the first node connected via the second node). Please see my screenshot from my systems in the Targets.jpg attachment. I'm assuming this is fine and the screenshot is just from the first node (meaning, each node should connect to its "own" witness via 127.0.0.1)?

: Targets.jpg (108.73 KiB) Viewed 109245 times

2) When I attempt to validate my configuration in the Failover Cluster Manager, everything works great except for one issue, relating to MPIO on my test disk 0 (which I assume is the witness disk). I get a warning that "Test Disk 0 from node ... has 1 usable path(s) to storage target" for each of my nodes. Please see the screenshot in the MPIO.jpg attachment. Is this correct? Will this cluster still be supported by Microsoft? Am I doing something wrong?

: MPIO.jpg (172.07 KiB) Viewed 109245 times

Thank you all in advance for all your help!

Branin

Wed May 27, 2015 11:25 am

Hi Branin,

Witness should be connected only locally. Your configuration, which is shown on screenshots, looks good.

This message in Failover Cluster Validation means, that you have to connect to each iSCSI storage TWICE (assuming you have two different NIC’s for iSCSI connections). So just click connect on already connected targets once again and enter\select the second NIC addresses for iSCSI.

Branin · Wed May 27, 2015 12:31 pm

I'm confused. How can I connect twice to the witness, but only with localhost?

For example, on my first node (NW-VMHOST01), I'm connected to:

NW-VMHOST01-WITNESS: 127.0.0.1
NW-VMHOST01-CSV1 and NW-VMHOST02-CSV1: 127.0.0.1, 172.16.120.2, 172.16.125.2 (the .2 addresses are on my second node, connected via different NICs)
NW-VMHOST01-CSV2 and NW-VMHOST02-CSV2: 127.0.0.1, 172.16.120.2, 172.16.125.2

My second host is very similar, except with .1 addresses instead of .2.

In Failover Cluster Validation, Test Disk 1 and 2 are just fine, each with 3 usable paths, according to the message. However, Test Disk 0 only has one path (which makes sense, since it is only connecting via 127.0.0.1). How do I either connect to the witness via multiple paths, or make it ok (i.e. no warnings) for only 1 path to exist?

Thank you!

Branin

Branin · Wed May 27, 2015 6:22 pm

If I set up my witness with three connections as such:

On NW-VMHOST01:
NW-VMHOST01-WITNESS, connecting to: 127.0.0.1, 172.16.120.1 (connecting via 172.16.120.1 [yes, the same IP]), 172.16.125.1 (connecting via 172.16.125.1 [yes, the same IP])

and on NW-VMHOST02:
NW-VMHOST02-WITNESS, connecting to: 172.0.0.1, 172.16.120.2 (connecting via 172.16.120.2 [yes, the same IP]), 172.16.125.2 (connecting via 172.16.125.2 [yes, the same IP])

then I don't get the warning in the Failover Cluster Validation report. Is this what I'm supposed to do? It doesn't exactly feel right...

Thanks.

Branin

Thu May 28, 2015 10:24 am

Not exactly. You have to connect witness only locally (127.0.0.1) and only once.
You have to connect CSV1 and CSV2 locally (once) and remotely (twice).

Branin · Thu May 28, 2015 1:44 pm

When I connect it only once, I get the warning message in Failover Cluster Validation about only having a single path to the witness disk. How do I keep Failover Cluster Validation happy with only having one connection to the Witness disk? (The CSV1 and CSV2 disks are correct and working great.)

Thanks!

Branin

Fri May 29, 2015 9:09 am

Connect Witness disk twice. Run Validation. Remove remote connection.

However this is warning only. It can be ignored.

Branin · Mon Jun 01, 2015 12:21 pm

Thank you very much!

Branin

Tue Jun 02, 2015 10:53 am

Please keep us updated about your progress! Thank you!

Branin wrote:Thank you very much!

Branin

Branin · Wed Jun 03, 2015 7:53 pm

Everything is mostly going great, except I have an high availability problem. :- (

Basically, I followed the directions exactly and have 2 nodes with 2 CSVs total and 1 VM on each of the CSVs (for testing purposes). I can move the VMs and CSVs between nodes all day long, without issue. I can shutdown one of the nodes and everything moves over to the remaining node and stays up. The problem occurs when I unplug the power cable to one of the servers (I know that the server shouldn't ever just lose power like this, but you never know, right?). Basically, the CSV on the node without power goes to "Online Pending" and then "Failed" (and obviously, the VM that is on the CSV doesn't come back online).

In the Cluster Event Log, I have the following errors, in order (I put both CSVs and VMs on the same node just for testing):

--- 11:31:22
Cluster node 'NW-VMHOST02' was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.
--- 11:31:22
Cluster Shared Volume 'CSV2' ('CSV2') has entered a paused state because of '(c000020c)'. All I/O will temporarily be queued until a path to the volume is reestablished.
--- 11:32:07
Cluster resource 'CSV1' of type 'Physical Disk' in clustered role 'c27d43f6-bfd8-4461-976b-bce64eeb549a' failed. The error code was '0x45d' ('The request could not be performed because of an I/O device error.'). Based on the failure policies for the resource and role, the cluster service may try to bring the resource online on this node or move the group to another node of the cluster and then restart it. Check the resource and group state using Failover Cluster Manager or the Get-ClusterResource Windows PowerShell cmdlet.
--- 11:32:12
Cluster resource 'CSV1' of type 'Physical Disk' in clustered role 'c27d43f6-bfd8-4461-976b-bce64eeb549a' failed. The error code was '0x45d' ('The request could not be performed because of an I/O device error.'). Based on the failure policies for the resource and role, the cluster service may try to bring the resource online on this node or move the group to another node of the cluster and then restart it. Check the resource and group state using Failover Cluster Manager or the Get-ClusterResource Windows PowerShell cmdlet.
--- 11:32:12
The Cluster service failed to bring clustered role 'c27d43f6-bfd8-4461-976b-bce64eeb549a' completely online or offline. One or more resources may be in a failed state. This may impact the availability of the clustered role.
--- 11:32:12
Clustered role 'c27d43f6-bfd8-4461-976b-bce64eeb549a' has exceeded its failover threshold. It has exhausted the configured number of failover attempts within the failover period of time allotted to it and will be left in a failed state. No additional attempts will be made to bring the role online or fail it over to another node in the cluster. Please check the events associated with the failure. After the issues causing the failure are resolved the role can be brought online manually or the cluster may attempt to bring it online again after the restart delay period.
--- 11:32:22
Cluster resource 'CSV2' of type 'Physical Disk' in clustered role '2fc8117b-8a0a-4368-8270-a0c01fd03ff1' failed. The error code was '0x45d' ('The request could not be performed because of an I/O device error.'). Based on the failure policies for the resource and role, the cluster service may try to bring the resource online on this node or move the group to another node of the cluster and then restart it. Check the resource and group state using Failover Cluster Manager or the Get-ClusterResource Windows PowerShell cmdlet.
--- 11:32:27
Cluster resource 'CSV2' of type 'Physical Disk' in clustered role '2fc8117b-8a0a-4368-8270-a0c01fd03ff1' failed. The error code was '0x45d' ('The request could not be performed because of an I/O device error.'). Based on the failure policies for the resource and role, the cluster service may try to bring the resource online on this node or move the group to another node of the cluster and then restart it. Check the resource and group state using Failover Cluster Manager or the Get-ClusterResource Windows PowerShell cmdlet.
--- 11:32:27
The Cluster service failed to bring clustered role '2fc8117b-8a0a-4368-8270-a0c01fd03ff1' completely online or offline. One or more resources may be in a failed state. This may impact the availability of the clustered role.
--- 11:32:27
Clustered role '2fc8117b-8a0a-4368-8270-a0c01fd03ff1' has exceeded its failover threshold. It has exhausted the configured number of failover attempts within the failover period of time allotted to it and will be left in a failed state. No additional attempts will be made to bring the role online or fail it over to another node in the cluster. Please check the events associated with the failure. After the issues causing the failure are resolved the role can be brought online manually or the cluster may attempt to bring it online again after the restart delay period.
---

After I wait a little while, I'm able to right click on the CSV in Failover Cluster Manager and select "Bring Online" which works just fine. What can I do to get the CSVs to failover quicker (or alternatively, have the cluster keep on attempting for a little while longer until everything is connected on the "new" node)?

Thank you!

Branin

Branin · Thu Jun 04, 2015 8:38 pm

If it helps at all, I'm seeing lines like the following in my Cluster log (get-clusterlog) for one of the CSVs. The other one sometimes transitions fine and other times fails.

00001068.00001e00::2015/06/04-17:42:46.431 INFO [RES] Physical Disk : SetDiskInfo(2)

00001068.00001e00::2015/06/04-17:42:46.431 INFO [RES] Physical Disk : Arbitrate - Node using PR key 9a6f746e0001734d

00001068.00001e00::2015/06/04-17:42:46.432 ERR [RES] Physical Disk : Failed to register key, status 2

00001068.00001e00::2015/06/04-17:42:46.432 ERR [RES] Physical Disk : ResHardDiskArbitrateInternal: PR Arbitration for disk Error: 2.

00001068.00001e00::2015/06/04-17:42:46.432 ERR [RES] Physical Disk : OnlineThread: Unable to arbitrate for the disk. Error: 2.

00001068.00001e00::2015/06/04-17:42:46.432 ERR [RES] Physical Disk : OnlineThread: Error 2 bringing resource online.

00001068.00001e00::2015/06/04-17:42:46.432 INFO [RES] Physical Disk : HardDiskpSetUnsetDiskFlags(mask=0x00000002, SetCluster=0, SetCsv=0, SetMaintenanceMode=0, Notify=1, Update=0) for device=2

00001068.00001e00::2015/06/04-17:42:46.432 ERR [RHS] Online for resource Cluster Disk 3 failed.

And then, later on in the log:

00000b6c.00000d84::2015/06/04-17:42:46.432 INFO [RCM] Will retry online from long delay restart of Cluster Disk 3 in 900000 milliseconds.

which explains why about 15 minutes later, the CSV has come online (although the VM stored on that CSV has long since failed and won't come back online for a while).

Obviously, waiting 15 minutes after a hard failure for the CSVs to reliably come online (and much longer for the VMs themselves to actually spin up) doesn't quite work, if at all possible to fix.

Thanks!

Branin

Branin · Thu Jun 04, 2015 11:43 pm

I've saved some logs from my Cluster, if that helps at all. I see a bunch of "MountTargetDll failed with error code 1627!" errors at the end, but am not sure if they are a cause of my problem or a symptom. NW-VMHOST01 is the node that I'm keeping up (and downloading the logs from), while NW-VMHOST02 is the node that I'm pulling power on. Also, CSV2/Cluster Disk 2 seems to come online fairly often after a power cycle, but CSV1/Cluster Disk 3 is the one that fails nearly every time (until the 15 minutes pass and it comes online by itself, or a minute or so passes and I can manually bring it online).

I ran the entire process twice, once saving the ErrorsAndWarnings log (with the log level set to 2) and once saving the Errors log (with the log level set to 1).

Errors.txt is https://www.dropbox.com/s/ertx2gngjz4ji ... s.txt?dl=0 while ErrorsAndWarnings.txt is https://www.dropbox.com/s/o6btfq91aqbli ... s.txt?dl=0.

Thanks in advance!

Branin

bubu · Mon Jun 08, 2015 1:05 pm

Hi,

I encountered similar issue. Ensure that you completely understand concept of failover cluster. The thing you're trying to achieve should be possible if CSVs and witness are owned by different servers.

Branin · Mon Jun 08, 2015 6:46 pm

They are (the witness is typically on NW-VMHOST01 and I'm putting the CSVs on NW-VMHOST02). It doesn't matter. Generally, at least one of the CSVs failing over don't finish successfully (timing out and leading to a "failed" status). 15 minutes later, the CSV is good, but then I still need to wait a while for the VM itself to come back online. I'm getting pretty close to giving up.

Branin

bubu · Wed Jun 10, 2015 12:43 pm

Let me show you what I've found:

http://knowledgebase.starwindsoftware.c ... ting-note/

CNIXInitiatorISCSITransport::MountTarget: EXITing with failure, MountTargetDll failed with error code 1627!11/28 0:54:56.643 1928 Sw: *** ConnectPortal: An attempt to connect timed out without establishing a connection or connect was forcefully rejected
loads of errors when one of the nodes was down. This is normal since the node did not accept external connections.

It does not shed any light on the issue, but at least refutes your worries. I hope StarWind support get back with an advice soon.