Hi,
OK, first I'm sorry that I've taken this thread off on a tangent from the OT. But it's all about HA so at least it's related! Secondly, sorry for the really long post...
Just to clarify: I knew that Starwind 5 was going to be susceptible to split brain failure because it was one of the first scenarios I tested during the beta. I've planned to avoid it, but I still think that it needs to be addressed in the product. I don't claim that it's easy to do.
The actual show stopper for me is that I can't rely on resync. Since I joined this thread, I found out about 5.2 and repeated my tests with that, and since then I have also received an unreleased (so far) patch from support that wasn't designed to address my specific problem, but does seem to have helped a little bit.
BTW, I am a Starwind customer since 4.1, I currently run 4.2 in production, and I have bought the upgrade to 5 HA. I have no problems with Starwind 4.2. But I need HA, and have invested in a lot of kit to deliver it!
OK, I've been asked about my setup, so here goes...
1) I run Starwind to provide an iSCSI storage for a Hyper-v cluster. That cluster has gone from Windows 2008 to 2008 R2 with live migration and CSV
2) Originally I ran 4.1 on a physical server running Windows 2008
3) Because of problems with that server (nothing to do with Starwind), and also to get ready for starwind 5, I ditched it and migrated to running 4.2 inside a VM running one of my two servers that will be running Starwind 5. I intend to continue to virtualize Starwind, as there a lots of benefits and no real drawback in my experience.
4) So I've got two identical nodes. One has Starwind 4.2 and Starwind 5 (now 5.2) each in their own VMs, and the other just has Starwind 5. Because I'm still testing Starwind 5, I've only given each VM a boot disk and a data disk (actually disks, they are both RAID-1 pairs
hardware specs... both nodes are identical...
Chassis: 4U Supermicro, with 8x 3.5" hotswap and 3x 5.25" bays each with 4x 2.5" mobile racks - so I have capacity for 8x 3.5" disks, and 12x 2.5" disks internally. Redundant power supplies, also my own UPS and datacentres's UPS.
Motherboard: Supermicro X8DTH-6F
http://www.supermicro.com/products/moth ... DTH-6F.cfm - this is dual cpu capable, has 7x PCIe 2.0 slots (and enough lanes to run them all at 8x) and 12 DDR3 slots
CPU: Only one Xeon L5520 @ 2.2Ghz. This is a quad core Nehalem, low voltage
RAM: 6GB DDR3
NIC: Intel dual port 10GbE (CX4 connectors)
RAID: motherboard has LSI SAS-2 RAID which is being used for the 3.5" drives. Also Areca 1680ix-24, which is a 28 port RAID card, and has 4GB battery backed cache. 12 ports are for the 2.5" drives, the remaining 14 are routed to external ports for future JBOD expansion
Drives: I only use RAID-1, and have examples of 320GB, 500GB, 2TB SATA, and 73GB, 300GB SAS as well as some Intel X25-M SSDs.
Software:
Physical server is running Windows 2008 R2, with Hyper-V role installed
Starwind VMs are Windows 2008 R2 booting off VHDs on the physical server, but targets are on pass through drives. I almost always use IMG targets
The physical servers are not domain joined, but the VMs are
Network:
On board LAN is used to manage the physical servers. Hyper-v virtual switches are bound to the 10Gbit ports.
One 10Gbit port goes directly to the other node, no switch involved. Cheap and low latency!
The other port goes to a Dell 6224 (24x 1G + 4x 10G), one switch per node.
Hyper-v nodes have 2x 1G ports in a team, with one port going to each switch, and a single 10G port going to one switch.
The two switches are connected by a 6x 1G LAG that carries all VLANs. Becuase there is only one production Starwind at the moment, this means that half my hyper-v nodes are limited by having to traverse that LAG, but they don't produce enough i/o for this to be a problem yet. When I deploy Starwind 5 HA, I intend to set up the initiators so that they prefer the Starwind node they can reach by pure 10G.
The switches have a redundant power supply, and are on the same UPS as the starwind servers, and the whole building has a UPS too.
VLANs: due to port shortage, I unfortunately have to have my iSCSI and management network traffic in the same VLAN. However the management traffic is tiny, and I've got 10G to play with. The VMs that use Starwind for VHD storage are themselves are each in their own VLANs and also firewalled off, and their traffic never touches the iSCSI network.
Performance: I've given my production Starwind VM just one core and 1GB RAM, and it is hardly taxed even though I've got about 40 VMs running off it. It's only doing benchmarks that I can really stress it! Generally I see network spikes of up to 25% on the 10G connection, but average is below 3%. Doing a benchmark against SSD storage I can get to 50% before CPU maxes out, so given that I can quadruple CPU allocation, I can easily cope with 100% of 10G given fast enough drives. With more / faster CPUs, more RAID cards, SSDs, and more NICs, I think I could push 3-4GB/sec before needing more iSCSI servers. Although I've not done performance testing with 5 yet, I'm still trying to get it to work reliably!
Problems I've had with sync:
1) Starwind service on node that is out of sync will sometimes spontaneously shut down (still happens in 5.2 with patches)
2) With 5.0 (not repeated yet in 5.2) I had a situation where starwind on the ok node crashed while syncing with the previously failed node - corrupting data on both nodes
3) If starwind is spontaneously shutting down, this can get into a repetitive cycle where I can't recover that node. In this situation, I have sometimes been able to keep it running long enough to remove the target, then deleted the img and sync file, then removed from other node, and recreate target by using the img on the non-failed node. But this obviously means downtime, and will be extremely complicated and time consuming with multiple targets
I'm heartened that with 5.2 I found that I can force a full resync which helped once.
Jlaay: I had a face-slapping moment when you said to check the time on the system clocks; sure enough I'd stupidly left the clock sync (with physical server) option on in Hyper-v settings and they'd got out of sync. turning this off ensured that both nodes had the same time from the domain controller, which I verified, but unfortunately I still get the same issues. There aren't any networking issues that I can find.
How I know I got data corruption? Well, firstly both nodes are shown as out of sync by starwind. With 5.0 I couldn't force a resync from one to another, with 5.2 I've found that you can. But that's only if you know that one of your nodes has a good copy of the img. The other signs: apps running on the initiator machine fail - e.g. continous ATTO benchmark reports an error, a VHD creation fails, windows explorer hangs, disk manager hangs, or for a really cool demo - you set up VM on your hyper-v cluster and start installing windows on it, and although the VM didn't crash, windows installation inside it complains that data it thought it had written isn't there any more and you have to start again!
Since 5.2 I haven't got as far as data corruption, but I've not pushed it too hard yet - I'm being hit by the starwind service shutting down before I can resync.
OK, now on to split brain...
Why I think I'm unlikely to be affected by a split brain failure....
1) Although I have a seperate network for sync, it's on the same NIC as the network used to talk to initiators. Therefore, unless just one port fails, a NIC failure would also take out the initiator network, meaning no data would be written to that node
2) CX4 cables are hard to detach accidentally, and the servers are right at the top of the rack! It's a short (50cm) cable too, not dangling anywhere
3) When I run Starwind 5 HA in production, I won't be using round-robin MPIO, as I've found performance to be less than when using MPIO active/passive failover - probably overhead of the MPIO load balancing algo, or due to one path being 10G and the other 1G. With failover only, I won't be writing to both nodes simultaneously, which hopefully lessens the risk of data corruption in a split brain scenario...
However...
As I expand I will have to put more NICs in the servers. I will then have to use teaming for the sync channel. NIC team failover hasn't been as robust as MPIO in my experience, so I'm not sure if I can trust it...
At some point I want to run HA between datacenters (fast - 100Mbit,1Gbit and even 10Gbit metro ethernet within central London is almost affordable) so that I can survive a disaster at one DC, and live migrate between DCs. Now, even though they will claim I am getting diverse fiber, chances are that there will still be a point where they are physically close to each other, and some idiot with a pick axe will find a way to break them. I don't want to end up with corrupt data in both nodes!
Switches, and NICs can fail. Drivers can fail. Cables can fail. Lots of things can go wrong, no matter what you do with teaming and MPIO, and UPS, which could lead to the situation where your sync network is dead, but your initiators can still see both nodes, and are writing to both nodes. In this situation, you want your high availability cluster to shut down a node, so that your data isn't corrupted.
Suggestions:
I've been doing clustering for about 10 years but I am definitely not an expert on the subject, so this is just a brain dump...
- the primary node for an HA target needs to take it over in the event of a split brain, and the secondary needs to shut down
- all networks that connect the two nodes should be used for heart beats
- actual sync data (i.e. data that has changed that one node is sending to the other) should be able to go over more than one network (MPIO, not teaming) and the user should be able to spec which (e.g. if you have 4x 10G and 2x 1G, you might want 2x 10G for talking to initiators, and 2x10G for sync, but you want all six for heartbeat)
- witness server - perhaps lightweight (starwind free?) instance that is connected to both HA nodes via heartbeat/sync networks - if a node can't see witness, it knows that it should shut down. This is how SQL Server mirroring works (as opposed to SQL on windows failover cluster)
- mandate that round-robin MPIO NOT be used for HA to reduce risk of data being written to a node that should have shut down
- only confirm a write to the initiator once you've confirmed that both nodes have written it to disk
Finally a question, I've never tried enabling the crc option in the MS initiator - would that protect against data corruption in split brain scenarios?
cheers,
Aitor