VM Performance during full sync

jimbul · Wed Apr 06, 2011 2:10 pm

Dear all,

At present i have one test VM on our eventual, intended production VM cluster with HA 16tb starwind licenses.

I have starwind 5.6 configured as follows:

2x SAN machines (DL370's, G7, 12gb RAM, 500gb 7200 8x2.5" disks in raid10) 6 nics on each machine configured with jumbo frames and starwind recommended ip enhancements as follows:

-Storage NIC's x3 on 3 separate subnets (my plan is to provide different CSV's over different nics to avoid saturation - is this dumb?), all 1gb. no df gw or dns. VLAN500 with IP's in these ranges 192.168.240.0/27, 192.168.240.32/27, 192.168.240.64/27, the san storage NIC's correspond and are at the other end of these /27 ranges.
-Management NIC x1 on LAN (VLAN1), 1gb, 10.0.1.235/236
-Heartbeat NIC crossover to other SAN server, 1gb - No VLAN 192.168.241.3/4
-Sync NIC x1 on crossover to other SAN server, 1gb - No VLAN 192.168.242.3/4

2x Hyper-V R2 servers, 6 nics on each, configured with jumbo frames and starwind recommended ip enhancements as follows:
-VM switch on 2x adaptors, teamed. 1gb each, 2gb teamed - no ip.
-Storage NIC's, x2 on seperate subnets to correspond with first two storage NIC's on SAN servers, 1gb on VLAN500 (sit in same subnets as corresponding SAN NIC's)
-Heartbeat NIC x1 on crossover to other VM host, 1gb - No VLAN
-Management NIC x1, 1gb - (VLAN1) 10.0.1.250/251

All targets are configured as HA, the quorum has 64mb of write back and the 100gb csv has 1024mb of write back. Sync and hearbeat are configured. the windows initiator is configured with mpio for each target, all cluster tests pass with green, live migration works well, performance seems great under load.

I have no performance problems in terms of speed to/from SAN, i see 115mb constant on all the interfaces moving iscsi traffic under normal circumstances, seems to work great.

When i do a gentle shutdown of the san it sync's back again fine, there are no performance issues during fast sync. If there is a harsh shutdown, the VM's continue to run but any operation beyond a ping is next to impossible while they are connected during a full-sync. If we are to lose one side of the san we need things to continue to run during the slow-sync. My question is, what am i doing wrong? Clearly connectivity remains but it's too clogged up to be useful.

While the slow sync occurs the sync nic is 60% utilised (despite managing tcp wirespeed when not dealing with sync traffic).

I read elsewhere in the forum http://www.starwindsoftware.com/forums/ ... tml#p12694, that you should put the heartbeat on a vlan on the client connection nic, by this do you mean the management connection OR the iscsi connections that the VMHosts connect to? (see below quoted section from url)

1. On each server's client connection NIC - deploy 2 VLANs resulting in 2 subnet IP separation (e.g 10.10.10.1 and 10.10.20.1)
10.1 will be used for heartbeat, 20.1 will be used for client connections.
Your config should now have 3 addresses to connect to e.g.
10.10.100.1 - Server1 Sync
10.10.20.1 - Server1 Client
10.10.10.1 - Server1 Heartbeat

10.10.100.2 - Server2 Sync
10.10.20.2 - Server2 Client
10.10.10.2 - Server2 Heartbeat

Any advice appreciated I've been banging my head against a wall on this for 4 days now. My only remaining issue is the VM performance during a full sync, i was expecting it to be viable for production use given everything else i'm reading so am keen to know what i've configured incorrectly.

Thanks,

Jim

Wed Apr 06, 2011 3:23 pm

I read elsewhere in the forum http://www.starwindsoftware.com/forums/ ... tml#p12694, that you should put the heartbeat on a vlan on the client connection nic, by this do you mean the management connection OR the iscsi connections that the VMHosts connect to? (see below quoted section from url)

The heartbeat connection should be on the iscsi connection NIC, heartbeat does not generate any traffic so there is no need to dedicate a physical port for it.

As for the performance you're getting - there are 2 possible reasons for this:
1. The RAID array is not capable of delivering desired random access performance level
2. In your case StarWind is not saturating the sync channel on 100% - this is going to be fixed from our side within 5.x branch
3. the same as 2 but for the client iSCSI sessions, we're working on the MCS implementation which will intelligently balance the load between the active sessions thereby increasing the access speed.

jimbul · Thu Apr 07, 2011 1:05 am

Hi Max,

- Thanks for clarification.

1. I hear you, but when desired performance level is to continue to run one VM, doing nothing, all i've tried to do is open the start menu - takes 30 seconds, and a sync that isnt even saturating a 1gbe connection when we have an 8 disk Raid 10 array on a 512mb bbwc card (which i confess isnt exactly electric) I find it hard to believe that the array is overloaded. Aside from that performance from volumes that aren't syncing on the same raid set is fine. I know this is bad practice and i wouldnt do it in production but i wanted to see if the raid set truly was overloaded - it isnt, far from it. How can I/we look at this further as I am not able to implement this in HA if it cannot offer a minimum of 20pc of it's usual performance during a full sync.

2. That's good to hear, when will this be released?

3. Ok, sounds good.

On a separate note, are there plans to update the documentation? I've found the forums very useful (essential in fact) but got hamstrung with the nic config and there is no mention of a heartbeat interface in the online manual for 5.6, not that i could see anyway.

Thu Apr 07, 2011 8:20 am

jimbul wrote:Hi Max,

- Thanks for clarification.

1. I hear you, but when desired performance level is to continue to run one VM, doing nothing, all i've tried to do is open the start menu - takes 30 seconds, and a sync that isnt even saturating a 1gbe connection when we have an 8 disk Raid 10 array on a 512mb bbwc card (which i confess isnt exactly electric) I find it hard to believe that the array is overloaded. Aside from that performance from volumes that aren't syncing on the same raid set is fine. I know this is bad practice and i wouldnt do it in production but i wanted to see if the raid set truly was overloaded - it isnt, far from it. How can I/we look at this further as I am not able to implement this in HA if it cannot offer a minimum of 20pc of it's usual performance during a full sync.

2. That's good to hear, when will this be released?

3. Ok, sounds good.

On a separate note, are there plans to update the documentation? I've found the forums very useful (essential in fact) but got hamstrung with the nic config and there is no mention of a heartbeat interface in the online manual for 5.6, not that i could see anyway.

1. The RAID array may be not overloaded but it performs worst in random access than in sequential.
During full sync maximum resources are used by the sync process and clients get much smaller chance to use their required resources, in V5.7 we will introduce priorities to balance between sync and clients connection, this will allow clients (VMs) to continue running normally during the full sync (but the full sync will take more time to be achieved).
2. It will be released in May.
Regarding the documentations, we are working on it now, some documents are already updated. Complete update of all documentation should be achieved till mid-summer. We have dedicated document for heartbeat, it is not available on our site yet but we can provide it on demand.

jimbul · Thu Apr 07, 2011 8:56 am

1. The RAID array may be not overloaded but it performs worst in random access than in sequential.
During full sync maximum resources are used by the sync process and clients get much smaller chance to use their required resources, in V5.7 we will introduce priorities to balance between sync and clients connection, this will allow clients (VMs) to continue running normally during the full sync (but the full sync will take more time to be achieved).
2. It will be released in May.
Regarding the documentations, we are working on it now, some documents are already updated. Complete update of all documentation should be achieved till mid-summer. We have dedicated document for heartbeat, it is not available on our site yet but we can provide it on demand.

I'd be very grateful if you could pass me the heartbeat documentation update.

On the other points, if that's the case i'll await 5.7 before trying to implement HA.

Thanks,

jim

Thu Apr 07, 2011 9:11 am

jimbul wrote:
I'd be very grateful if you could pass me the heartbeat documentation update.

On the other points, if that's the case i'll await 5.7 before trying to implement HA.

Thanks,

jim

OK, you will receive the document on the email address you're using on forum.
Personally I thing you can start implementing HA even before the release of v5.7, reason is very simple: you will not have to run full sync frequently, in most cases fast sync (which doesn't decrease the performance of the clients) is launched.

jimbul · Fri Apr 08, 2011 2:37 am

OK, you will receive the document on the email address you're using on forum.
Personally I thing you can start implementing HA even before the release of v5.7, reason is very simple: you will not have to run full sync frequently, in most cases fast sync (which doesn't decrease the performance of the clients) is launched.

Thanks for the document, and i accept your point completely, it's unlikely to happen frequently, the problem I have though is that if i run in HA and on the off-chance (and i know how unlikely this is, but it happens to us all every so often) one of the SAN servers falls over and blue screens, i have to wait for a full sync to finish before my servers are usable, at least on my choice of hardware, that might take all day at half gbe for some volumes, sadly my users wont tolerate no mail or file access so it's not a risk i can take, just in case. I will implement with 5.7 when we can throttle the wire-speed sync so we have half or one third speed access during full sync. For now, it's too risky for me.

Just out of interest can you give me an example of the kind of hardware you would need to have in place currently (on 5.6) for one VM to be accessible during a full sync of the image it resides on?

Cheers,

Jim

Aitor_Ibarra · Fri Apr 08, 2011 10:51 am

Just out of interest can you give me an example of the kind of hardware you would need to have in place currently (on 5.6) for one VM to be accessible during a full sync of the image it resides on?

You need to throttle down the bandwidth on the sync network. There's several ways to do this. If you are familiar with windows QoS you can use that to set bandidth limits on the sync NIC. This will give you quite precise control. If you have a switch between the servers, there may be QoS features on the switch, or maybe you can set the speed of the NICs to a lower speed. If you are using 10GbE like me, you can't set a lower speed, what you could do is connect a 10GbE switch to each server and link the switches using 1GbE.

However, although you are keeping your VMs running, you are prolonging the time before you restore HA. So it's a tradeoff. What if you want to restore HA quickly AND not interrupt the VMs? There are several possibilities, but you need to rehearse them, they are not something you want to learn how to do during a crisis...

If you have another HA Starwind array with enough capacity, you can use Quick Storage Migration in SCVMM to move the location of the VHDs from one array to another, without shutting down the VMs (there will be a couple of brief interruptions when the VMs state is saved and then restored, but they are brief). You can do this between HA targets on the same Starwind setup (e.g. if you have spare drives, just create a new HA target, use QSM to move VMs from one to another).

Now for the really crazy solution: If you are using the same RAID cards in each Starwind node, and you are using RAID1 or RAID10, and you are 100% sure which disks are which, you can:

Shutdown all VMs and the Hyper-V nodes
Remove the HA target (but don't delete the IMG!)
Shutdown the Starwind boxes
Move half the disks (RAID1 or RAID10 use mirroring) from one server to another - so you have a complete set of data on both servers
Import the RAID into the RAID card
Give both servers spare drives so the RAIDs can rebuild
Get the volume up in Windows
Create a new HA target with the same name as the old one, point it at the IMGs, do not sync
This would be the fastest way if you had lots of data. The trick is that involves no copying of data, because RAID 1/10 already gives you a copy of the data. But you have to really know what you are doing.

Sun Apr 10, 2011 10:13 pm

1) Playing with QoS on local NIC or directly on the switch should work. But it's not kind of the tuning a) everybody could do b) we do expect to work everywhere... So we'll represent custom sync channel bandwidth throttling even sooner then we've promised.

2) Really crazy idea

But should work. We've called such an approcah "FloppyNet" few decades ago

jimbul · Mon Apr 11, 2011 4:07 am

All very useful - thank you.

michal · Mon Apr 11, 2011 7:41 am

anton (staff) wrote:2) Really crazy idea But should work. We've called such an approcah "FloppyNet" few decades ago

OT: Crazy as it is, still very much in use, not typically in SAN's mind you but...

I believe people call it "sneakernet" nowadays.

http://en.wikipedia.org/wiki/Sneakernet

People even surf the net or send emails by "floppynet" or "sneakernet"! http://en.wikipedia.org/wiki/Wizzy_Digital_Courier

Mon Apr 11, 2011 9:54 am

Hmm, I think I've got an Idea of a killer solution which will sync terabytes in a blink of an eye...

Mon Apr 11, 2011 10:35 am

"Old Harleys Never Die!" (c) ...

LOL

michal wrote:
anton (staff) wrote:2) Really crazy idea But should work. We've called such an approcah "FloppyNet" few decades ago
OT: Crazy as it is, still very much in use, not typically in SAN's mind you but... I believe people call it "sneakernet" nowadays.
http://en.wikipedia.org/wiki/Sneakernet

People even surf the net or send emails by "floppynet" or "sneakernet"! http://en.wikipedia.org/wiki/Wizzy_Digital_Courier