vSAN manual failover to avoid split brain

kktwenty · Mon Mar 28, 2022 9:10 am

I am speccing out a starwind vs S2D vs stretch cluster. We currently use an MS hyper-v stretch cluster with 4 computer servers and 2 storage sans split into two physical buildings. I am looking to retire this in favour of 3 storage/compute servers (HCI single box) be they in S2D or starwind configuration - these with be split with two physical boxes in the primary site and one box in the secondary site .

One of the remaining questions is split brain functionality. Our current stretch cluster is in two physical buildings connected by a pair of 10Gb operating in failover mode (only one 10gb used for traffic at any one time so not teamed as such). Primary site nodes have a vote each on the cluster plus a quorum NAS share. Secondary site nodes have no vote so cannot bring the cluster up on their own without manual intervention. This prevents split brain should the pair of links go down (quorum NAS and internet link is on the primary site network). VMs are currently running on the primary site nodes.

Does starwind have the capability to remove a vote from its storage replication node - i.e I dont want the secondary site taking any "ownership" decisions unless I manually deem so? I know I can set the microsoft portion of hyper- v clustering "vote" for the secondary site but what about the vsan storage replication? Should the secondary link go down I do not want the secondary site to take over automatically in any way. Quorum NAS will remain in the primary site. VMs will only be running on primary site hyper-v nodes with no affinity to the secondary node (and no vote on the secondary cluster hyper-v node)

Incidentally, Primary site compute/storage servers will be directly connected to each other AND the quorum NAS, this will be for iscsi, heartbeat and quorum etc. The third secondary site node will be via the same 10Gb links (via switches) and thus prone to link power failure.

Tue Mar 29, 2022 1:28 pm

Here is the document on stretched clustering https://www.starwindsoftware.com/starwi ... clustering and https://www.starwindsoftware.com/resour ... rver-2016/. You might want to set up a stretched cluster with node majority if the system does not satisfy system requirements in the context of networking https://www.starwindsoftware.com/system-requirements.

Split-brain can be avoided by setting up an SMB witness ( https://www.starwindsoftware.com/help/C ... ness.html- not available for VSAN for vSphere for now) OR a witness node as described here https://www.starwindsoftware.com/resour ... -strategy/. Please also note that the witness should be located in a different location to work as intended. If located in the same room as any of your servers, there will be the storage outage if the witness and the DATA node go down.

The third secondary site node will be via the same 10Gb links (via switches) and thus prone to link power failure.

Please elaborate on the use of the 3rd site.

If you need more assistance in POA tests, please reach out to us at support@starwind.com use this thread as a reference.

kktwenty · Wed Mar 30, 2022 9:30 am

The hyper-v clustering side I have sorted. That is already in place and a mature solution that has been tested and used with power failure. This is our CURRENT setup:

We have two physical buildings, the primary building is classed as our "main" building. Internet connection enters here and has the bulk of our networking. The secondary building is a smaller building connected by a redundant pair of 10Gb lines. To mitigate the risk of loss due to fire, flood or catastrophic loss of the main server room it was decided to have a resilient system between the two buildings. Currently we use MS stretch cluster, this has a pair of servers connected to a SAN in primary building and a pair of servers connected to a SAN in the second building. A NAS physically located in the primary building acts as a quorum. The nodes in the secondary building do NOT have a vote in the cluster. This is how the system works now:
Scenario 1 - node in primary site fails. Quorum + second node decide that second node will now host all the VMs until failed node is returned to use.
Scenario 2 - node in secondary site fails. Nothing happens to the running of the cluster other than an error that a secondary node has failed.
Scenario 3 - storage fails in primary site. Redirected I/O to the storage in the secondary site takes place. VMs remain running on primary nodes using redirected IO. Once the primary SAN is back and running there will be heavy network usage as the data is re-synced back from secondary site to the primary site - this is also whilst redirected I/O is taking place (this IS slow!)
Scenario 4 - power failure in secondary site. Nothing happens to the running of the cluster. Heavy IO use when the power is returned and a resync takes place from primary to secondary.
Scenario 5 - power failure in primary site. The cluster shuts down totally, the secondary site needs to be issued with powershell commands to come online, the storage is accessible and not marked as "degraded", the cluster can be brought online with a set of three powershell commaneds: Votes are added manually to the secondary site so that when the primary nodes are brought up they will defer to the secondary site as authoritative. Primary site is manually started up and heavy resync from secondary to primary takes place, primary site acknowledges secondary site as authoritive since they have votes now. communication between secondary and primary MUST work before primary site is rebooted to avoid split brain (as quorum + two nodes on primary would be enough to continue running independently!).

There is no internet in secondary site so a cloud quorum cannot be used. Since scenario 5 deals with a network failure between the two sites there would be no "middle denominator" visible to both sites. Power is independent to both sites and they are situated 100m apart physically. I have experienced all 5 of these scenarios with both testing and in practice.

It is scenario 5 that I am most hesitant with starwind. My new proposal is for either S2D or Starwind to operate with 3 nodes. Two nodes in the Primary site and one in the secondary. From my research, S2D can be configured exactly the same way as hyper-v cluster: I can add votes to the primary site nodes (+ quorum) and take a vote away from the secondary node. I have not tested this as I am at the planning stage. I would like to see if starwind can be set as similar. This is my PROPOSED setup:

Scenario 6 - starwind+compute nodes 1 and 2 live in primary site and have visibility of a quorum NAS. Secondary site has starwind+compute node 3. Network visibility between primary and secondary site is lost - this could be due to power loss, someone cutting through my fibre, fire in primary site. I know already that I can set my cluster to have "no votes" therefore the cluster will need to be brought up manually, however I dont know if the starwind node 3 can be brought online for the hyperconverged cluster on node 3 to be used or does starwind node 3 list itself as degraded. Once visibility of nodes 1 & 2 (plus the quorum) come back up, how do I let starwind know that node 3 has been authoritive whilst nodes 1,2 + quorum have been "down"? I dont want starwind 1,2+quorum deciding that "they" are authoritive whilst node 3 has been running the cluster+storage during the failure.

In hyper-v cluster (in this situation) the act of adding votes to my secondary site is enough to convince the cluster that the secondary site is authoritive - even though "technically" the primary site comes back with votes + quorum.

Nodes 1,2,3 will be set as synchronous HA of course. Incidentally backups are stored in primary and secondary sites.

EDIT: Having read https://www.starwindsoftware.com/blog/w ... o-avoid-it what I am looking for is witness node whereby node 1,node2, witness node get a vote. node3 does not get a vote. Therefore in a normal network partition (link between primary and secondary is cut so it is just a communication issue) the primary site will correctly say that the secondary site node 3 is "not synchronised". However, if nodes 1 and 2 are brought down by power failure (along with any suitable witness or quorum), then I would like to use node3 as the cluster storage (and bring up the cluster manually on node3).

I did see this caveat at the bottom of the article:

In case if an HA device consists of three storage nodes or two storage nodes and one witness node, only one node can be disabled. In case of failure of two nodes, the third node will also shut down.

which leads me to believe this is not only impossible with starwind but also makes a 3 node HA pointless as it appears starwind will not operate with 2 HA nodes out of 3 HA nodes failed? Or have I read this incorrectly?

Thu Mar 31, 2022 5:14 pm

Greetings,

I think it is better to set up a tech call here

Please contact us at support@starwind.com

It is important to separate shared storage and Failover Cluster. With StarWind VSAN, having a witness in the "primary" location will result in downtime for the entire system if the entire room goes dark as the node in the "secondary" location cannot form a majority being unable to communicate to other StarWind Servers and therefore marks itself as not synchronized. This is why I suggest using witness in 3rd location. Also, StarWind VSAN does not allow having 2 witnesses.
From the storage side, your claims are right but Failover Cluster also handles compute resource availability. See this article to understand how a cluster handles compute resource availability https://docs.microsoft.com/en-us/window ... clustering. In case of a node failure, VM will not instantly failover. This is how Microsoft Failover Cluster itself works. Storage though will be instantly available in case of node failure due to active-active replication. HINT: Applies only to hyperconverged scenarios.

Scenario 3 - storage fails in primary site. Redirected I/O to the storage in the secondary site takes place. VMs remain running on primary nodes using redirected IO. Once the primary SAN is back and running there will be heavy network usage as the data is re-synced back from secondary site to the primary site - this is also whilst redirected I/O is taking place (this IS slow!)

Performance depends on the link speed. If connections are a bottleneck, performance will be degraded, otherwise, VMs notice no difference.

There is no internet in secondary site so a cloud quorum cannot be used. Since scenario 5 deals with a network failure between the two sites there would be no "middle denominator" visible to both sites. Power is independent to both sites and they are situated 100m apart physically. I have experienced all 5 of these scenarios with both testing and in practice

then I would like to use node3 as the cluster storage (and bring up the cluster manually on node3).

This is why I suggest using a node in a different location.

My new proposal is for either S2D or Starwind to operate with 3 node

Do you mean a 3-way mirror?

however I dont know if the starwind node 3 can be brought online for the hyperconverged cluster on node 3 to be used or does starwind node 3 list itself as degraded.

Do you mean a 3-way Mirror? If so, it provides 3x storage nodes that carry the same set of data. You cannot set a 3-way mirror with node majority and a witness though.

3 node HA pointless as it appears starwind will not operate with 2 HA nodes out of 3 HA nodes failed

It is not pointless. It just does not fit the scenario of the stretched cluster without fine-tuning. The heartbeat failover strategy allows for having 2 nodes down while a 3-way replica with node majority and no witness still does not.

I believe that there is a slight misunderstanding of the fundamentals of node majority here. The witness is to help to decide which site is to stay online and which is to go down. So, it should be independent of 2 sites to act properly as we cannot create 2 witness devices.

kktwenty · Fri Apr 01, 2022 2:55 pm

I have my answer now thank you.

I am happy with my hyper-v cluster setup. That part is tried and tested and works as intended - indeed I have tested this "in anger" following a failed UPS stress test in my primary building. I am only looking at the underlying storage technology replacement. My new Storage and compute will be hyperconverged be that as S2D or starwind.

From what I can see now, Starwind will perform a 3 way mirror but cannot survive 2 node failure unless a 4th witness (non storage) node is used. All 3 storage + 4th witness nodes must be in communication with each other for this to take effect.

Starwind will partially support failed communication/partitioned network failover . I.e. for a 3 way node with 4th witness: If a network partition occurs whereby nodes 1,2 + witness are partitioned from node 3, node 3 will be marked as out of sync - the data living on node 3 cannot be used under any circumstances

for a 3 way node with 4th witness: If a network partition occurs whereby nodes 1,2 + witness completely fail and node 3 continues, node 3 will be marked as out of sync - the data living on node 3 cannot be used under any circumstances - the data is lost.

Microsoft work differently with their HA nodes and can operate with 1 node remaining out of 3 nodes + quorum even if the quorum has died this is MS active-passive model of a stretch cluster. The remaining passive node must be activated manually but the data is not lost. When resuming from this scenario, the remaining nodes+quorum accept that the single 3rd node is authority and resync back. After resync the 3rd node can be demoted again to passive. I see that starwind cannot operate in this fashion so my answer is given.

The only way I can see a 3 way node + witness working as HA is to have an infallible witness which (by design) can only happen with a cloud witness in a regular disaster mitigation plan of separate buildings. Take this diagram:

: starwind.jpg (35.5 KiB) Viewed 2299 times

node 1 + 3 could fail, node 1+2 could fail - both of these are good as the remaining node(s) + witness will keep the data up. However, witness + node 1 + 2 (i.e. a fire destroys building one) means the data on node 3 is useless. Starwind would not suit me as the solution cannot be made to work without me finding a way to have a resilient cloud connection in both buildings as a witness, and not just a witness but a full computational node witness.

If I move the witness to building two then I risk split brain if I lose the interconnect (which has happened when a truck drove through the link once). In this scenario, if the building interlink drops there is enough quorum on both sides to operate. We have nodes 1 + 2 on one side and node 3 + witness on the other. However I COULD survive a fire in building ONE.

: starwind2.jpg (35.88 KiB) Viewed 2297 times

Ironically in this scenario I would be relying on the fact that my hyper-v cluster WOULD NOT start on node 3 in this scenario since node 3 is passive. However once the link is resumed I have no idea what starwind will do as nodes1 and 2 will have been running in mirror and node 3 starwind will have been running (albeit with no data change since there would have been no cluster running on node 3 as I would not have manually started the cluster on 3)

I am not familiar using vmvsan but a quick google would seem to suggest that VM stretch vsan supports the manual start of a node. This is via active-passive the same as hyper-v/S2D. Again, if the Active node(s) is(are) "down" you need to provide logic as to what to do (as per hyper-v/S2D solution). In my case I would say node 1 and node 2 are "active", node 3 as "passive" thus meaning the data on node 3 is not lost, it is simply not used until I designate it "active". I have no intention of going to vmware, I was merely curious on how they support a nodes+witness failure in 3 way stretch.

I am content with active-passive downtime for the loss of building one. Active-active for the loss of a single node. The only part I cannot fathom is what happens when a starwind cluster goes down. Can you manually bring up a remaining node and it function? What happens when the OTHER nodes + witness are brought online after a manual start?

Thank you for your time explaining the starwind side. It wont be worth exploring your tech support as I was only looking at the "free" version, not the trial or paid.

Thu Apr 07, 2022 8:43 pm

Hi,

As suggested, please let us have a tech call. Log a case with support@starwind.com. Guess the discussion will just go faster and smoother during the call.

Starwind will perform a 3 way mirror but cannot survive 2 node failure

Not exactly. In this very configuration, this is correct, but it does not apply to 3-way mirror setups.

What you could try is a small re-designing of the system to make it more stable. Say, why not use a VM in the cloud or a PC in another room?

The only way I can see a 3 way node + witness working as HA is to have an infallible witness which (by design) can only happen with a cloud witness in a regular disaster mitigation plan of separate buildings

This is not the only way. We can look at what split brain fundamentally is. What is possible (BUT TOTALLY CAN NOT BE RECOMMENDED) is a strict data locality. Say, there are 2 devices replicated with a heartbeat failover strategy in a 2-node system. One device has the 1st priority in building one another device has the 1st priority in building two, so VMs always run on the device that has the highest priority. When network incident happens, split brain happens too but no data arrive to the device of the lower priority.

What you can also do is set a 2-way mirror in building 1 and set up Veeam Backup and Replication in Building two: you will have a primary and DR sites.

If I move the witness to building two then I risk split brain if I lose the interconnect

Witness helps to form the majority. It puts the storage in the situation of "all or nothing": the storage is not available or not. It should not cause a split-brain with Node Majority.