2 node HA split brain

Software-based VM-centric and flash-friendly VM storage + free version

Moderators: anton (staff), art (staff), Max (staff), Anatoly (staff)

muhfugen
Posts: 10
Joined: Wed Nov 23, 2016 11:10 pm

Thu Aug 03, 2017 10:41 pm

From what i've read on the forums (https://forums.starwindsoftware.com/vie ... f=5&t=3440), split brain can be an issue when using a 2 node HA configuration, when total network failure occurs between the two nodes. I was wondering if anything has been done to solve this issue in the past three years, such as being able to have a witness?
User avatar
anton (staff)
Site Admin
Posts: 4008
Joined: Fri Jun 18, 2004 12:03 am
Location: British Virgin Islands
Contact:

Fri Aug 04, 2017 12:48 pm

a) StarWind has redundant heartbeat networks and dynamic witness (same way Windows Server 2012/2016 cluster works), it eliminates the need in an external witness.

b) You can run 3 nodes now.

c) You can have a witness with an upcoming R6 update (end of August 2017). You might want to play with a beta right away.

All of these a), b) and c) solve what you're afraid of.
muhfugen wrote:From what i've read on the forums (https://forums.starwindsoftware.com/vie ... f=5&t=3440), split brain can be an issue when using a 2 node HA configuration, when total network failure occurs between the two nodes. I was wondering if anything has been done to solve this issue in the past three years, such as being able to have a witness?
Regards,
Anton Kolomyeytsev

Chief Technology Officer & Chief Architect, StarWind Software

Image
muhfugen
Posts: 10
Joined: Wed Nov 23, 2016 11:10 pm

Fri Aug 04, 2017 4:27 pm

Thanks a lot Anton.
muhfugen
Posts: 10
Joined: Wed Nov 23, 2016 11:10 pm

Fri Aug 04, 2017 4:50 pm

anton (staff) wrote:dynamic witness (same way Windows Server 2012/2016 cluster works)
Do you know where I could find more information about configuring a dynamic witness in VSAN? I cant seem to find much documentation from google beyond forum posts and Geo Clustering PDF which mentions it in passing.
User avatar
anton (staff)
Site Admin
Posts: 4008
Joined: Fri Jun 18, 2004 12:03 am
Location: British Virgin Islands
Contact:

Sat Aug 05, 2017 9:52 am

That's beta functionality. Drop a line to anton AT starwind DOT com and I'll get you in touch with techies for preview builds and early documentation. Just mention in the subject what's it all about ;)
muhfugen wrote:
anton (staff) wrote:dynamic witness (same way Windows Server 2012/2016 cluster works)
Do you know where I could find more information about configuring a dynamic witness in VSAN? I cant seem to find much documentation from google beyond forum posts and Geo Clustering PDF which mentions it in passing.
Regards,
Anton Kolomyeytsev

Chief Technology Officer & Chief Architect, StarWind Software

Image
wallewek
Posts: 114
Joined: Wed Sep 20, 2017 9:13 pm

Wed Nov 22, 2017 4:26 pm

Let me see if I understand this correctly.

Are you saying that a StarWind Free HA 2-node VSAN is currently susceptible to split-brain failure -- having both hosts coming up and running independently -- and there's nothing we can currently do about it? I.e., there's no witness/quorum functionality at the VSAN level in StarWind Free?

And are you saying that this functionality does exist in the paid-license version?

Is there a document somewhere to which you would refer me to clarify this? Any recommendations on how to prevent it, beyond simple redundancy?

Thanks for any clarification you could provide.

-- Ken
------------------------
"In theory, theory and practice are the same, but in practice they're not." -- Yogi Berra
User avatar
anton (staff)
Site Admin
Posts: 4008
Joined: Fri Jun 18, 2004 12:03 am
Location: British Virgin Islands
Contact:

Wed Nov 22, 2017 4:50 pm

I never told anything like that! StarWind vSAN Free is 100% identical in terms of the functionality compared to commercial version, it's thick UI and various support plans making the difference.

https://www.starwindsoftware.com/whitep ... s-paid.pdf

If you want to play complete paranoid (one can always survive according to A. Grove) you can combine redundant heartbeat networks and external witness, I think we'll release it with our next update. You can apply for RC right now.
wallewek wrote:Let me see if I understand this correctly.

Are you saying that a StarWind Free HA 2-node VSAN is currently susceptible to split-brain failure -- having both hosts coming up and running independently -- and there's nothing we can currently do about it? I.e., there's no witness/quorum functionality at the VSAN level in StarWind Free?

And are you saying that this functionality does exist in the paid-license version?

Is there a document somewhere to which you would refer me to clarify this? Any recommendations on how to prevent it, beyond simple redundancy?

Thanks for any clarification you could provide.

-- Ken
Regards,
Anton Kolomyeytsev

Chief Technology Officer & Chief Architect, StarWind Software

Image
wallewek
Posts: 114
Joined: Wed Sep 20, 2017 9:13 pm

Thu Nov 23, 2017 4:44 pm

Thanks you Anton, pardon my misunderstanding.

But I'm still looking for a better understanding: HOW does StarWind prevent split-brain operation in a two-host HA environment?

For instance, what would happen if all network connectivity between the two physical hosts were suddenly lost, but the hosts were otherwise unaffected, and still reachable by other systems?

Is there, for example, some sort of status on each host that tells them which one is allowed to run independently, and the other not, unless a human intervenes? Like a "quorum stick" that is owned by one of them at any given time? And if the owning host has failed, the other still cannot start without human intervention?

If you could explain, or refer me to documentation, I would appreciate it.

You mentioned an external witness functionality that is due to be released. I would like to know more details about how that works, too.

Some background:
Years ago, we used a different two-host HA storage virtualization product called VM6 VMEX, whose vendor has gone out of business. There was a storm that caused abrupt power failure and partial network hardware failure. When power came back on, the cluster was unable to resume operation at all because of the lack of network functionality between the hosts, or external quorum, until we came on site to resolve things.

As a result of that incident, I've given a lot of thought to the question of two-host quorum: in principle, it doesn't take much to have somewhere to store a status flag, a lock, something giving one host quorum. I've even thought about using ARP cache, cloud-based storage or a symbolic DNS alias.

So I'm really interested in the details.

-- Ken
------------------------
"In theory, theory and practice are the same, but in practice they're not." -- Yogi Berra
Boris (staff)
Staff
Posts: 805
Joined: Fri Jul 28, 2017 8:18 am

Fri Nov 24, 2017 9:34 am

wallewek,

Hope this Knowledge Base article can answer your question.
https://knowledgebase.starwindsoftware. ... planation/
wallewek
Posts: 114
Joined: Wed Sep 20, 2017 9:13 pm

Fri Nov 24, 2017 3:45 pm

Thank you Boris,

I would say that KB article is incomplete, but it does appear to confirm one of the failure modes I described.
It says:
If data can`t be transferred through the synchronization channel StarWind checks the availability of the second node through the alternate network interface, and shuts down the secondary node in case of synchronization channel failure.
Which implies a primary/secondary or "quorum stick" (my term) approach.

Therefore, if the "primary" host in a 2-host HA cluster abruptly fails, the cluster as a whole will fail as well, because the heartbeat and sync will have stopped, and there will not have been an opportunity for the cluster software to automatically "fail over" to the other host, as would occur in a controlled shutdown.

Thus human intervention will be required for cluster recovery from abrupt failure of the primary host.

I presume that is what the beta witness functionality, described earlier in this thread, is intended to address.

Please provide some information on how that witness will work, and its infrastructure requirements,

-- Ken
------------------------
"In theory, theory and practice are the same, but in practice they're not." -- Yogi Berra
Michael (staff)
Staff
Posts: 317
Joined: Thu Jul 21, 2016 10:16 am

Mon Nov 27, 2017 4:58 pm

Ken,
Let me explain a little bit how it works.
With heartbeat failover strategy, HA devices have assigned priority number. If synchronization channel is down, StarWind services will talk to each other via heartbeat and device with the highest priority number will be marked as not synchronized by design. If synchronization and heartbeat channels disappear simultaneously, both devices will stay synchronized which will lead to split-brain for sure. That is why we do recommend assigning more independent heartbeat channels during replica creation to make sure that one of the nodes will become not synchronized. As a summary, with heartbeat failover strategy, the storage cluster will continue working with only one StarWind node available.
With node majority failover strategy, HA devices have only synchronization channel and each of them should have a connection to Witness node which is a part of HA device but contains no data. In this scenario, the main requirement for keeping nodes operational is an active connection with more than a half of the HA device’s nodes. Nodes that can communicate with more than a half of the device's nodes (including themselves) remain operational. As a summary, with node majority failover strategy with 2 storage nodes and one Witness node, if one node does not see others, it will mark itself as not synchronized and will reject client connections.
Once Witness node feature is released, we will publish documentation about it. Please let us know if you have other questions.
wallewek
Posts: 114
Joined: Wed Sep 20, 2017 9:13 pm

Tue Nov 28, 2017 2:43 am

Thank you very much Michael, that really helps.

One thing: I thought StarWind 3-node clusters already had some some sort of quorum function, does it not?

-- Ken
------------------------
"In theory, theory and practice are the same, but in practice they're not." -- Yogi Berra
Michael (staff)
Staff
Posts: 317
Joined: Thu Jul 21, 2016 10:16 am

Tue Nov 28, 2017 10:05 am

In the current build, StarWind 3-node configuration has only heartbeat failover strategy.
wallewek
Posts: 114
Joined: Wed Sep 20, 2017 9:13 pm

Tue Nov 28, 2017 9:08 pm

Thank you Michael, that's very helpful.

-- Ken
------------------------
"In theory, theory and practice are the same, but in practice they're not." -- Yogi Berra
Michael (staff)
Staff
Posts: 317
Joined: Thu Jul 21, 2016 10:16 am

Tue Nov 28, 2017 11:06 pm

You are welcome :)
Please let us know if you have other questions.
Post Reply