Fri May 06, 2011 6:16 pm
Simplistic risk calulation... I am not very good at maths so please someone jump in and correct this if I make a mistake.
RAID0 probability of failure = number of drives x probability of single drive failure
Then Starwind HA is effectively RAID1 which reduces your risk. Effectively you have RAID 0 + 1 which is going to be a little more risky than a two drive RAID 1, but no where near as risky as a pure RAID 0.
So... let's say you go mad with cheap 2TB SATA drives, let's assume a 5% annual failure rate, and stick 8 of them into a RAID 0. The probability of that RAID going down within a year is 40%.
A single drive fails - there's a 40% chance it could happen in a year. That means 0.1% chance of it happening on any given day. Or, roughly 5.69% that it could happen on a Saturday, sometime over a year. The chance that a drive could fail in the other RAID0 before Monday, when someone can replace the failed drive, is 2 x 0.1% = 0.2%. Multiply the two together and you have 1.13% chance that in a year, you will get two drive failures over a weekend, hosing all your data.
If a volume dies, then the HA volume on that Starwind node will die too, but be ok on the other node. Starwind itself will stay running (unless Windows/Starwind are running off same volume!). You will have to recreate volume from scratch, format it, shut your working HA partner down gracefully, shut down all your iSCSI clients, bring back working partner, delete the target (but not the img), copy the img across to the other server, and recreate the HA target in Starwind, choosing not to sync.
As you went for 8 x 2TB drives, and you've got a 10G connection (let's assume) between servers, and it's RAID0, each drive can do 100MB/sec, so you should be able to max out that 10G connection and get that img copied across in about 6 hours. That's pretty much a day's downtime. What will that cost you, and is it worth the roughly 40% chance that it will happen once in a year?
If you go for any form of rebuildable RAID 1,5,10,6 - with a hotspare - not only do you reduce risk, but you don't have to worry so much about someone swapping the drive... and it would take at least two drive failures on the same array before you have to worry about a full resync.
If Starwind goes down but your RAIDs are OK? Depends on how long you are down, if it's within the fast sync window then you will be up again with no downtime. I would say that this is far less risky to your data than RAID0, but the probability of it happening is pretty much 100% (at least for planned downtime) because at some point you will need to update Windows and reboot.
Anyway, it all hinges on the number of drives and their reliability. Oh, also, with those large capacity drives, silent data corruption is a big risk too, which increases with capacity.