StarWind’s automatic failover and failback

imrevo · Thu Jan 14, 2010 12:04 pm

Constantin (staff) wrote:Let`s imagine: we mark 1 target as primary, 2 target as slave. Than heartbeat fails. How initiator will know which of storage node is primary? The only possible workaround - to make all customers use our iSCSI initiator, that will be able to determine which node is primary, if it will inform that heartbeat has failed.

If heartbeat fails, only the SW Servers need to know to which one data is written by an initiator (compared to "from the second HA-Server as part of the HA mirroring"). Only the server that receives data from an initiator has to stay active while the other one should mark his HAImage as "not in synch" and should not accept data for this HAIMage from an initiator (in case ESX switches his paths) until the image is in sync again. As soon as heartbeat is reestablished, the sync process could even start automatically.

Am I wrong?

Aitor_Ibarra · Thu Jan 14, 2010 12:05 pm

Constantin,

The initiator wouldn't have to know which is primary or slave. Starwind on the slave would disconnect all sessions when the heartbeat fails, so the initiators would automatically use the primary, just as they do when the link between the initiator and target has failed. The only requirement is MPIO support, which is required for HA anyway.

cheers,

Aitor

JLaay · Thu Jan 14, 2010 12:15 pm

Hi Constantin and Aitor,

Constantin

MPIO:
I don't agree with your remark that you cannot predict on which one it will be written. At least in case when using VMware.
If you choose Path Selection Policy: Fixed, then the connection to the same storage server will always be tried first

Teaming NICs:
There is a lot of talk and therefore confusion about using Nic-teaming or MPIO for connections to iSCSI storage.
For the moment I 'll stick with MPIO after reading the post:
http://virtualgeek.typepad.com/virtual_ ... phere.html
Read: Step 2 – configure explicit vmkNIC-to-vmNIC binding. > Let’s be REALLY ......
I 'll prefer MPIO (fixed path) in a HA solution with devided primary targets.

Aitor (

Nice Basque name)

My favorite subject. I still do not understand why deviding primary targets would be such a bad idea. I still think it would good for performance, depending an what kind of applications are used, how many concurrent users there are and how these users use these applications. Shall we open a topic on this one

? I think we should !

Constantin and Aitor
And now to the point: when the sync link goes down between hosts with each having primary targets ...

It mustn't be too hard to build in a mechanism into Starwind software which can detect and store the information on which one of the hosts is running the life connection to the physical server or VMware host. Then when the sync link does goes down the host not having the life connection knows that it has to break the connection to the physical server or VMware host.

Constantin: If the Starwind iSCSI connector really does only support a host level protection, then this kills the flexibility I 'm looking for.

Question. Do Starwind HA users know that they are not protected for split brain issues when they are:
A. Not using Starwinds iSCSI initiator
B. Have devided primary targets

Constantin (staff) · Thu Jan 14, 2010 3:37 pm

The problem is that, if nodes lose connection what is the reason of it: another node failed, NIC came out of order or cable is broken. And imagine, if 1st Node fails, 2nd will decide that heartbeat fails, and will go down. Question: what kind of HA is it, if in any case it will go down?

Aitor_Ibarra · Thu Jan 14, 2010 10:37 pm

Constantin,

The simple answer is to look at how Microsoft handle failover clustering, particularly how they handle heartbeats and quorum. They require a dedicated heartbeat network between nodes, but that's not the only network that's used. In your problem:

if 1st Node fails, 2nd will decide that heartbeat fails, and will go down.

This would not happen if the nodes can see each other over two networks (e.g. the sync network, which is used for syncronisation data, and the network used to talk to initators). You could argue that this is essentially providing the same level of resiliency as nic teaming on the sync network. Also, it would only happen if the node that failed was also the master/primary node for the target. Finally, if Starwind could detect the difference between not being able to reach the partner node and cable failure, and there was a switch between systems, node 2 in your example wouldn't shut down.

And even then the weakness can be mitigated against, e.g. by having a witness and quorum system (votes).

Question: what kind of HA is it, if in any case it will go down?
The situation we have now is not that failure of the sync network causes failure of HA. It's that failure of the sync network causes irretrievable (as far as I know) data corruption, because both nodes allow data to be written to them when the sync network is down. It would actually be better if both nodes immediately shut down, preserving the data, than the situation we have now!

Jaap,

Basque names yes, but very little real ancestry; probably 97% non-Basque actually!

MPIO - I'm not familiar with vmware's mpio, but microsoft allow you to choose between primary and failover paths, or load balanced paths with weightings.

Teaming: there are even more options here than with MPIO and what you go with depends on the capabilities of your swictches and NICs. I've generally found MPIO to be a little more reslilient than teaming.

I thought Starport (Starwind's initator for windows and what Starwind uses for sync) didn't support MPIO, but it does support RAID 1 of targets which is another way to achive HA. I've never used Starport - main reason being that Microsoft do not support anything except their own initiator for Hyper-v.

cheers,

Aitor

JLaay · Fri Jan 15, 2010 9:54 am

Hi Aitor / Constantin:

Constantin:
I sensed some irritation on my previous post. Not my intention, but a well known issue when using mail.
I think it was caused because we had different configs in mind. To avoid this maybe we should mention the config (higher level) when starting/answering a topic.
I look at this problem with in mind: VMware hosts with MS Windows servers (no MS Clusters) and two storage servers.
And everything optional in between: (redundant) switches, (redundant) links, same room/building, same/seperate locations.

Knowing that:
Do you think that this split-brain issue, sync-link failure between the two storage nodes, should be a priority for Starwind to solve.
At least I think it would be appreciated by Starwind HA users that they would be pro-actively informed about a possible split-brain issue (if not already done).
I 've searched for it in the documentation and forum, but didn't find any prominent remarks about it.

Aitor:
I haven't used Starport either. But their is another problem using Starport. At server reboot Starport looses the connection to the target and doesn't rebuild this connection.
No persistent connection option. (Constantin: correct me if I 'm wrong. Read Starport topic > RAID-1 - StarPort)

Greetz Jaap

Constantin (staff) · Fri Jan 15, 2010 12:43 pm

Yes, multiple heartbeat channels is good option. In future versions we maybe will use it, but for now we use 1 heartbeat channel.

Aitor_Ibarra · Fri Jan 15, 2010 1:03 pm

Hi Constantin,

I guess there are four priorities here...

1) Failback (sync) has to be 100% reliable. It shouldn't be prone to causing Starwind service to crash and/or be causing loss of data on both nodes. This is the issue I currently have in support; hopefully it's something relatively easy for you to fix.
2) Loss of the sync network should not cause data corruption on both nodes, as it does right now
3) Ideally one node should be able to continue running if the sync network fails
4) Failback should be automatic

The first two are absolutely essential, in my opinion. 3 and 4 are important but not essential. In fact, with 4, I'd want a way to turn this off, e.g. if there was a serious problem with one of the nodes causing it to crash repeatedly.

I would add a UI issue - when you start a sync, the icons for the targets change back to normal even though the resync has only just begun - I think there should be an icon to show that they are currently re-syncing.

Thanks,

Aitor

Constantin (staff) · Fri Jan 15, 2010 1:12 pm

1. If StarWind service on one node fails or freezes this is equivalent to un-synchronization of node of nodes. StarWind can`t cause fail of heartbeat between nodes, it can fail by itself only.
2. The only way to do it - few heartbeat channels.
3. They are both OK for read operations, but write operations are not best option

4. Will be implemented in StarWind 5.5, which will be released soon.

JLaay · Fri Jan 15, 2010 2:30 pm

Hi Aitor / Constantin,

Ad 1. Failback (sync) has to be 100% reliable.
Do you mean synchronization or really fail-back?

Ad 3. Ideally one node should be able to continue running if the sync network fails.
Ad 4. Failback should be automatic

With the risk of becoming annoying

:
This with the assumption that all primary targets are on one storage host.

It would be great if failback would be implemented with the next release. Both ways

Could point by Aitor is the option to turn 'automatic' off.
However this was one of the 'evalution' questionsI sent, 04/01/2010, to Starwind with no answer yet.
Starwind is busy with auto-synchronization. I 've asked for the option of a predifined, sequential list.
This is added to change tracker. No time line yet.

Greetz Jaap

Aitor_Ibarra · Fri Jan 15, 2010 2:46 pm

Hi Constantin,

My 1) was changing the context a bit of this thread, so perhaps I should have made it clearer... I've had Starwind crash after reboot and during resync, and in one case the other node crashed as well, and both nodes ended up corrupt. This happened quite early on as I started to test re-sync to see if it is reliable. I didn't break the sync network as part of the test. When I say 100% reliable I mean that the re-sync process has to be 100% reliable, that it should not cause data corruption. There has to be no risk of both nodes becoming corrupt during a sync operation (assuming network is fine).

This issue means that I can't safely deploy Starwind 5. If I take down one node for updates, upgrades, etc I need to be 100% confident that when I bring it up again, the resync process will not carry a risk of me losing all my data, and not trigger a crash of the other node. Hopefully your support team will be able to identify and rectify the issue, as I really, really need HA (and have paid for it, and invested in the hardware to support it well).

I can live without 2-4 for now as my network is pretty resilient, but 2) really needs to be a priority, and not something left for a distant and chargeable version 6

. Data integrity is more important than availability. A high availabilty solution that in certain circumstances can corrupt your data is worse than a non-HA solution that keeps your data intact.

cheers,

Aitor

Constantin (staff) · Fri Jan 15, 2010 3:15 pm

For additional level of security in future versions we will implement N+1 DR solution and CDP for HA devices.

JLaay · Sat Jan 16, 2010 11:35 am

Hi Aitor,

Maybe a stupid question.
- How did you find out that files were corrupt?

Other things:
- Time synchronization
- Brand/type of switches
- Lan-settings sync link; jumbo/flow control, LACP/PAGP
- NIC-teaming settings
- Errors on switch(es) esp link sync ports
- ...

Greetz Jaap

Sat Jan 16, 2010 6:47 pm

I'm interested in the hardware config as well

)

Aitor_Ibarra · Sun Jan 17, 2010 2:20 am

Hi,

OK, first I'm sorry that I've taken this thread off on a tangent from the OT. But it's all about HA so at least it's related! Secondly, sorry for the really long post...

Just to clarify: I knew that Starwind 5 was going to be susceptible to split brain failure because it was one of the first scenarios I tested during the beta. I've planned to avoid it, but I still think that it needs to be addressed in the product. I don't claim that it's easy to do.

The actual show stopper for me is that I can't rely on resync. Since I joined this thread, I found out about 5.2 and repeated my tests with that, and since then I have also received an unreleased (so far) patch from support that wasn't designed to address my specific problem, but does seem to have helped a little bit.

BTW, I am a Starwind customer since 4.1, I currently run 4.2 in production, and I have bought the upgrade to 5 HA. I have no problems with Starwind 4.2. But I need HA, and have invested in a lot of kit to deliver it!

OK, I've been asked about my setup, so here goes...

1) I run Starwind to provide an iSCSI storage for a Hyper-v cluster. That cluster has gone from Windows 2008 to 2008 R2 with live migration and CSV
2) Originally I ran 4.1 on a physical server running Windows 2008
3) Because of problems with that server (nothing to do with Starwind), and also to get ready for starwind 5, I ditched it and migrated to running 4.2 inside a VM running one of my two servers that will be running Starwind 5. I intend to continue to virtualize Starwind, as there a lots of benefits and no real drawback in my experience.
4) So I've got two identical nodes. One has Starwind 4.2 and Starwind 5 (now 5.2) each in their own VMs, and the other just has Starwind 5. Because I'm still testing Starwind 5, I've only given each VM a boot disk and a data disk (actually disks, they are both RAID-1 pairs

hardware specs... both nodes are identical...

Chassis: 4U Supermicro, with 8x 3.5" hotswap and 3x 5.25" bays each with 4x 2.5" mobile racks - so I have capacity for 8x 3.5" disks, and 12x 2.5" disks internally. Redundant power supplies, also my own UPS and datacentres's UPS.
Motherboard: Supermicro X8DTH-6F http://www.supermicro.com/products/moth ... DTH-6F.cfm - this is dual cpu capable, has 7x PCIe 2.0 slots (and enough lanes to run them all at 8x) and 12 DDR3 slots
CPU: Only one Xeon L5520 @ 2.2Ghz. This is a quad core Nehalem, low voltage
RAM: 6GB DDR3
NIC: Intel dual port 10GbE (CX4 connectors)
RAID: motherboard has LSI SAS-2 RAID which is being used for the 3.5" drives. Also Areca 1680ix-24, which is a 28 port RAID card, and has 4GB battery backed cache. 12 ports are for the 2.5" drives, the remaining 14 are routed to external ports for future JBOD expansion
Drives: I only use RAID-1, and have examples of 320GB, 500GB, 2TB SATA, and 73GB, 300GB SAS as well as some Intel X25-M SSDs.

Software:
Physical server is running Windows 2008 R2, with Hyper-V role installed
Starwind VMs are Windows 2008 R2 booting off VHDs on the physical server, but targets are on pass through drives. I almost always use IMG targets
The physical servers are not domain joined, but the VMs are

Network:
On board LAN is used to manage the physical servers. Hyper-v virtual switches are bound to the 10Gbit ports.
One 10Gbit port goes directly to the other node, no switch involved. Cheap and low latency!
The other port goes to a Dell 6224 (24x 1G + 4x 10G), one switch per node.
Hyper-v nodes have 2x 1G ports in a team, with one port going to each switch, and a single 10G port going to one switch.
The two switches are connected by a 6x 1G LAG that carries all VLANs. Becuase there is only one production Starwind at the moment, this means that half my hyper-v nodes are limited by having to traverse that LAG, but they don't produce enough i/o for this to be a problem yet. When I deploy Starwind 5 HA, I intend to set up the initiators so that they prefer the Starwind node they can reach by pure 10G.
The switches have a redundant power supply, and are on the same UPS as the starwind servers, and the whole building has a UPS too.

VLANs: due to port shortage, I unfortunately have to have my iSCSI and management network traffic in the same VLAN. However the management traffic is tiny, and I've got 10G to play with. The VMs that use Starwind for VHD storage are themselves are each in their own VLANs and also firewalled off, and their traffic never touches the iSCSI network.

Performance: I've given my production Starwind VM just one core and 1GB RAM, and it is hardly taxed even though I've got about 40 VMs running off it. It's only doing benchmarks that I can really stress it! Generally I see network spikes of up to 25% on the 10G connection, but average is below 3%. Doing a benchmark against SSD storage I can get to 50% before CPU maxes out, so given that I can quadruple CPU allocation, I can easily cope with 100% of 10G given fast enough drives. With more / faster CPUs, more RAID cards, SSDs, and more NICs, I think I could push 3-4GB/sec before needing more iSCSI servers. Although I've not done performance testing with 5 yet, I'm still trying to get it to work reliably!

Problems I've had with sync:
1) Starwind service on node that is out of sync will sometimes spontaneously shut down (still happens in 5.2 with patches)
2) With 5.0 (not repeated yet in 5.2) I had a situation where starwind on the ok node crashed while syncing with the previously failed node - corrupting data on both nodes
3) If starwind is spontaneously shutting down, this can get into a repetitive cycle where I can't recover that node. In this situation, I have sometimes been able to keep it running long enough to remove the target, then deleted the img and sync file, then removed from other node, and recreate target by using the img on the non-failed node. But this obviously means downtime, and will be extremely complicated and time consuming with multiple targets

I'm heartened that with 5.2 I found that I can force a full resync which helped once.

Jlaay: I had a face-slapping moment when you said to check the time on the system clocks; sure enough I'd stupidly left the clock sync (with physical server) option on in Hyper-v settings and they'd got out of sync. turning this off ensured that both nodes had the same time from the domain controller, which I verified, but unfortunately I still get the same issues. There aren't any networking issues that I can find.

How I know I got data corruption? Well, firstly both nodes are shown as out of sync by starwind. With 5.0 I couldn't force a resync from one to another, with 5.2 I've found that you can. But that's only if you know that one of your nodes has a good copy of the img. The other signs: apps running on the initiator machine fail - e.g. continous ATTO benchmark reports an error, a VHD creation fails, windows explorer hangs, disk manager hangs, or for a really cool demo - you set up VM on your hyper-v cluster and start installing windows on it, and although the VM didn't crash, windows installation inside it complains that data it thought it had written isn't there any more and you have to start again!

Since 5.2 I haven't got as far as data corruption, but I've not pushed it too hard yet - I'm being hit by the starwind service shutting down before I can resync.

OK, now on to split brain...

Why I think I'm unlikely to be affected by a split brain failure....
1) Although I have a seperate network for sync, it's on the same NIC as the network used to talk to initiators. Therefore, unless just one port fails, a NIC failure would also take out the initiator network, meaning no data would be written to that node
2) CX4 cables are hard to detach accidentally, and the servers are right at the top of the rack! It's a short (50cm) cable too, not dangling anywhere
3) When I run Starwind 5 HA in production, I won't be using round-robin MPIO, as I've found performance to be less than when using MPIO active/passive failover - probably overhead of the MPIO load balancing algo, or due to one path being 10G and the other 1G. With failover only, I won't be writing to both nodes simultaneously, which hopefully lessens the risk of data corruption in a split brain scenario...

However...
As I expand I will have to put more NICs in the servers. I will then have to use teaming for the sync channel. NIC team failover hasn't been as robust as MPIO in my experience, so I'm not sure if I can trust it...

At some point I want to run HA between datacenters (fast - 100Mbit,1Gbit and even 10Gbit metro ethernet within central London is almost affordable) so that I can survive a disaster at one DC, and live migrate between DCs. Now, even though they will claim I am getting diverse fiber, chances are that there will still be a point where they are physically close to each other, and some idiot with a pick axe will find a way to break them. I don't want to end up with corrupt data in both nodes!

Switches, and NICs can fail. Drivers can fail. Cables can fail. Lots of things can go wrong, no matter what you do with teaming and MPIO, and UPS, which could lead to the situation where your sync network is dead, but your initiators can still see both nodes, and are writing to both nodes. In this situation, you want your high availability cluster to shut down a node, so that your data isn't corrupted.

Suggestions:
I've been doing clustering for about 10 years but I am definitely not an expert on the subject, so this is just a brain dump...

- the primary node for an HA target needs to take it over in the event of a split brain, and the secondary needs to shut down
- all networks that connect the two nodes should be used for heart beats
- actual sync data (i.e. data that has changed that one node is sending to the other) should be able to go over more than one network (MPIO, not teaming) and the user should be able to spec which (e.g. if you have 4x 10G and 2x 1G, you might want 2x 10G for talking to initiators, and 2x10G for sync, but you want all six for heartbeat)
- witness server - perhaps lightweight (starwind free?) instance that is connected to both HA nodes via heartbeat/sync networks - if a node can't see witness, it knows that it should shut down. This is how SQL Server mirroring works (as opposed to SQL on windows failover cluster)
- mandate that round-robin MPIO NOT be used for HA to reduce risk of data being written to a node that should have shut down
- only confirm a write to the initiator once you've confirmed that both nodes have written it to disk

Finally a question, I've never tried enabling the crc option in the MS initiator - would that protect against data corruption in split brain scenarios?

cheers,

Aitor