Witness iSCSI targets

enric · Wed Feb 13, 2019 9:16 am

Hi,

In a hyperconverged 3-node scenario, which targets should I connect to the Witness device?

In the tutorial (https://www.starwindsoftware.com/resour ... erver-2016) for a 2-node, there is a text that says:

NOTE: It is recommended to connect Witness Device only by loopback (127.0.0.1) address. Do not connect the target to the Witness Device from the StarWind partner node.

In the 3-node tutorial (https://www.starwindsoftware.com/resour ... -v-cluster) this text is missing, but in the screenshot on how should it looks like, if shows the Witness connected to 2 nodes, which I don't understand.

I think that I should connect Witness only to the local 127.0.0.1 IP, and left the other 2 nodes partner targets disconnected. Is that correct?

Thanks,

Enric

Wed Feb 13, 2019 10:53 am

Hi Enric,
Correct, it is recommended to connect Witness Device only by loopback (127.0.0.1) address. This recommendation is the same for 3-node setup.

Alexey Sergeev · Fri Oct 11, 2019 7:54 pm

Hello, guys.

Could you explain what's the point of connecting witness targets only through the loopback address?
And what exact behavior should we expect during fail over?
I've run a few tests and had some bad results.
Not sure if I've done everything correct, so please could you have a look?

Here's test config for 2-node Hyper-V cluster:

2 x HP DL380 Gen10 with Windows Server 2016 Datacenter
96 GB RAM on each
HPE Smart Array P816i-a SR controller
8 x 1.2 TB 12Gb SAS HDD in RAID 10
4-port 1Gb NIC teamed and connected to Hyper-V switch (1 vNIC for management and VSAN heartbeat, 1 vNIC for cluster heartbeat)
2-port 10Gb NIC (1 port for VSAN synch channel and 1 port for VSAN iSCSI)

I've created and connected targets (one for witness and 2 for CSVs ) according to your guide "https://www.starwindsoftware.com/resour ... erver-2016". Everything seems good, VM's migration works as expected so I decided to break something a little

test 1: node 1 (is not owner of the disk witness) - stop Starwind service
result: all VMs and CSVs are online
iSCSI-targets switch to partner's storage
auto-resynch after service is started again.

test 2: node 2 (owner of the disk witness) - stop Starwind service
result: all VMs and CSVs are online, but disk witness is off
iSCSI-targets switch to partner's storage
auto-resynch after service is started again.

test 3: node 1 (is not owner of the disk witness) - shutdown all network traffic except VSAN
result: node 1 is isolated
all disks are online
all iSCSI-targets on node 2 are connected
VM from node 1 restarted on node 2

test 4: node 2 (owner of the disk witness) - shutdown all network traffic except VSAN
result: cluster is down
need to execute "Start-ClusterNode -ForceQuorum" on node 1
cluster is back online with one node
all CSVs are online, but disk witness is off
all iSCSI-targets on node 1 are connected
all VMs restarted on node 1

test 5: node 1 (is not owner of the disk witness) - shutdown all network
result: node 1 is isolated
all disks are online
all iSCSI-targets on node 2 are connected
VM from node 1 restarted on node 2

As you could see in fourth scenario there's no fail over at all. Cluster is down because quorum is lost even though all local storage connected to the host.
What do you think about it? Did I have any mistakes in the config?

Sat Oct 19, 2019 4:35 am

Alexey,

Thanks for sharing your test results.
In fact, at the stage where all storage is connected and available at the level of StarWind VSAN your question does look related to Microsoft's failover cluster behavior rather than the behavior of StarWind VSAN, as you reported storage was working as expected during the tests. But the general idea here is that the cluster communication network (which is your management network in the majority of cases) is broken and the nodes cannot negotiate witness ownership.

As for the reasons why witness should be connected only locally, the main one is that in such a situation the cluster would not try looking for the dead path(s) to the witness in case one or more paths fail. If you connected the witness disk from both nodes and try breaking paths to the partner node (including the iSCSI link), in certain cases the cluster would shut down even though it has got active loopback session to the cluster just because it could not restore the path to the partner - thus exceeding the failover timeout. To eliminate possibility of such scenarios where the cluster dies all of a sudden we recommend connecting it via loopback only.
So, in a nutshell, connecting witness via loopback with no partner sessions is done to facilitate faster and error-free witness failover.

logitech · Sun Feb 25, 2024 9:57 am

Hi,

does the same configuration for witness work for converged infrastructure (3 compute nodes and 2 vsan storage nodes)?

Sun Feb 25, 2024 11:02 am

Connec the witness from each storage node to each client via one path. Use failover only for simplicity.