HA Sync Performance

Thu Sep 02, 2010 6:56 pm

Hold on for a second... I've re-read Constanting post and I don't see any place where he had suggested to use "el cheapo" hardware for production. He just told to use OTHER hardware and only to see would it make the difference or not. This is number one...

Yes, of course you're 200% correct on HA performance thing. And if our customers (at least some of them) do have issues we do consider this as being OUR problem and not our customers'. And we'll take care of this! We do actually

What we kindly ask for is - some assistance from your side helping to pinpoint, isolate and finally fix the issue... That's all

This is number two...

Thank you!

matty-ct wrote:Interesting idea there. Use a crossover cable to eliminate the switch? However, HP and Dell comprise the overwhelming majority of the world's servers. Claiming that HP, Dell, Broadcom and Intel all have it wrong is hard to swallow. I can't imagine any enterprise IT department having much confidence in the suggestion to use RealTek adapters in lieu of server class offerings from the big server vendors.

I don't claim to have the answer but if I sold any of my enterprise clients an expensive iSCSI solution and I then told them that we needed to add non-Intel, non-Broadcom, non-Dell, or non-HP NIC's to their servers, they'd seriously question my judgment and expertise. This answer gives me great pause. I find the Starwind solution an excellent one but questions regarding HA performance are pertinent. Good luck, guys!

Constantin (staff) · Fri Sep 03, 2010 7:52 am

Well, I don`t tell anything about Broadcom. It works great! Yesterday on one of forums I`ve found a lot of messages about work of Intel 1000 and MS iSCSI Initiator - it has problems with any target!
Currently we with help of one our customer, who recently has faced with familiar problem, have made a dump of all TCP traffic and StarWind logs, so today our R&D department will analyze it to see where bug is!

DavidMcKnight · Mon Sep 06, 2010 3:15 pm

For what it's worth we're having the same problem.

I have two datastores running server 2008 R2 with two quad core intel processors and an intel 10gig ethernet card. The RAID is on a Areca 1680i PCI express raid card with 15 1TB Seagate SAS drives. If I clone a VM on a non HA target I can get between 2 to 4 gigabit transfer speeds. Which to me is pretty fast. Once I enable HA the speed drops down below 1Gb to around that magic 700Mb I see in other posts. The really bad thing is that it slows down even the NON-HA targets on my boxes. I worked with tech support back stating in April for a couple months with no solution, back then they were telling me no one else was having this problem. I have all but given up on HA from starwind. I hope enough people complain that Starwind really tries to fix this problem.

Constantin (staff) · Mon Sep 06, 2010 3:39 pm

We are hardly working on it! I would like to make few screen sharing sessions to take network dumps and StarWind logs to analyze it in our R&D lab.
P.S. Recently I have made such one, as result support of server vendor has confirmed that it`s fault of hardware, and changed it.

Aitor_Ibarra · Mon Sep 06, 2010 4:50 pm

DavidMcKnight ,

Are both paths between your clients and your starwind boxes 10Gbit end-to-end, and are you using a seperate 10Gbit connection for sync?

I'm doing some heavy failover testing at the moment, and I see three difference performance levels when my "A" node is running by itself, my "B" node by itself, and when both are running normally - and this is totally explainable by looking at my network:

client box: MS iSCSI initiator, single Intel 10Gbit nic (actually lom!)
Switch A: 10Gbit / 1Gbit - 10Gbit to client box, 10Gbit to Starwind node A, 1Gbit to Switch B
Switch B: 10Gbit to Starwind node B, 1Gbit to Switch A
Starwind Node A: 10Gbit to Switch A, another 10Gbit to Starwind Node B
Starwind Node B: 10Gbit to Switch B, another 10Gbit to Starwind Node A

Both Starwind boxes: Starwind running inside Hyper-V VM (Win2k8 R2), Intel NICs, Areca 1680ix-24 RAID, and for this test, each has 2x 7200rpm SATA drives (2.5") in RAID 1. These are *not* great performers. The HA target is using Write Through caching, and with my test, most of the reads will fit in the cache, so the drives are getting mostly writes. I've written about 3TB of randomly generated data to my test target over five days with no issues, using random and sequential access patterns.

When both nodes operating, client box peaks at about 15% utilisation.
When node A operating by itself, client box peaks at about 30% utilisation.
When node B operating by itself, client box peaks at about 8% utilisation.

Why the weird scores? MS MPIO is attempting to balance the i/o across the two paths, but they are not equal, as to reach node B, it has to get through a 1Gbit connection. So it scores best when 100% of the i/o goes to the one server that it can reach on a pure 10Gbit network.

This doesn't explain the behaviour you are seeing, but I would look at the following:

1) Your network topology. Do any clients have to go through a 1Gbit connection to get to your Starwind boxes?
2) Your MPIO policies - how are these distributing i/o between your HA targets?
3) Is HA sync (between Starwind boxes) going over a 10Gbit path? Is this a DIFFERENT path to the one used to talk to clients?
4) Starwind HA uses the Windows iSCSI initiator for sync. There is a on-request hotfix for this which has cured some BSOD issues I was having - you may want to try it: http://support.microsoft.com/kb/979711/en-gb
5) Check your intel NIC drivers... not fun, although I've found them orders of magnitude better than their RAID drivers...

cheers,

Aitor

Mon Sep 06, 2010 7:54 pm

Aitor,
Great thank's for the detailed explaination.
We're investigating these performance problems digging into nearly all system logs available.
In the meantime the most useful thing you can do is to review every setting and cable one more time, IT is all about details.
Maybe you're a lucky one because you've just missed a setting, this can be solved much easier than reinstalling the drivers.

DavidMcKnight · Mon Sep 06, 2010 11:37 pm

Aitor_Ibarra wrote: Are both paths between your clients and your starwind boxes 10Gbit end-to-end, and are you using a seperate 10Gbit connection for sync?
Aitor

Yes, No. Although my Intel nic is a dual port CX4 card. I have grouped the ports together as a virtual nic. So I have port 1 going to switch A as active primary and port 2 going to switch B as standby secondary. With switch A and B redundantly connected to each other via 10gig.

Aitor_Ibarra wrote: 1) Your network topology. Do any clients have to go through a 1Gbit connection to get to your Starwind boxes?

No.

Aitor_Ibarra wrote: 2) Your MPIO policies - how are these distributing i/o between your HA targets?

Because of the virtual nics I've created, as far as the Starwind software knows there is only one path for HA.

Aitor_Ibarra wrote: 3) Is HA sync (between Starwind boxes) going over a 10Gbit path? Is this a DIFFERENT path to the one used to talk to clients?

Same path/same virtual 10gig nic for HA and general iSCSI traffic. I have tried to split it out and had the same slow speeds with HA enabled.

Aitor_Ibarra wrote: 4) Starwind HA uses the Windows iSCSI initiator for sync. There is a on-request hotfix for this which has cured some BSOD issues I was having - you may want to try it: http://support.microsoft.com/kb/979711/en-gb

I have had no issues with the stability of Windows or Starwind.

Aitor_Ibarra wrote: 5) Check your intel NIC drivers... not fun, although I've found them orders of magnitude better than their RAID drivers...

Last time I checked I had the lastest drivers/firmware for my Areca raid card. Last time I checked I had the latest drivers for my Intel nic.

Again let me be clear. I couldn't be happier with Starwind performance and reliability on my datastores when I don't have HA running. For me to be getting 2gb to 4gb when cloning a VM (theoretically half of that is writes) is great. So what is getting added to the equation with HA that would cause starwind to slow down so much and not just for me and others.

I'd love to hear from someone who is as happy with HA as I am with non-HA.

Aitor_Ibarra · Tue Sep 07, 2010 10:11 am

Hi David,

I'm pretty happy with HA although I'm at the final testing stages of the latest beta, I'm still running 4.2 (non HA) in production.

Yes, No. Although my Intel nic is a dual port CX4 card. I have grouped the ports together as a virtual nic. So I have port 1 going to switch A as active primary and port 2 going to switch B as standby secondary. With switch A and B redundantly connected to each other via 10gig.

I run the same card (Intel dual CX4). I've had issues with Intel's NIC teaming before, so I don't use it on my stawind boxes. With a failover mode like the one you are using, only one of the 10GbE paths will be in use at a time, and that path will be used both for taliking to clients, and for sync. So you would expect a drop of about 50% off peak bandwidth, e.g. if an initiator is writing to one starwind node, that same data is being written over the same wire to the other starwind node, so they have to share bandwidth. I can't see this explaining the massive drop you see, but it would explain something going from 1GB/sec to 500MB/sec say...

I really recommend you drop the team, and use one port for starwind <-> initiator and the other for sync. You can also then bypass your switches for sync, which means you gain an extra port on each switch. And you can then more easily monitor bandwidth on each port, which may help you troubleshoot if you still see a throughput drop.

Aitor_Ibarra wrote:
2) Your MPIO policies - how are these distributing i/o between your HA targets?

Because of the virtual nics I've created, as far as the Starwind software knows there is only one path for HA.

I meant MPIO on the clients. You are going to have at least two paths, one to each starwind node, for erach HA target. In the MS initiator, for instance, the default is to load balance across paths, but there are lots of other options, e.g. preferred path with failover, weighted paths etc. Sometimes you get better performance by using just one path active and the other as standby. In my case weighted paths works best, so I can split the paths 10:1 instead of 1:1.

I am probably running a more recent build of Starwind than you, also I am running Starwind as a hyper-V VM, so I setups have a few differences which could affect performance.

cheers,

Aitor

DavidMcKnight · Tue Sep 07, 2010 4:16 pm

Aitor_Ibarra wrote:I'm pretty happy with HA although I'm at the final testing stages of the latest beta, I'm still running 4.2 (non HA) in production.

What kind of speeds are you getting? On what kind of hardware?

Now I'm not trying to argue just to argue, I'm trying to debate this to become informed, so bear with me. To suggest that there is a problem with my teamed virtual nic doesn't make sense to me. I can get 2 to 4 gb on it without HA, so what has changed to make a perfectly good teamed virutal nic become such a problem? Also my delema is if I have a hardware issue with a cable or switch or one of the ports on the nic there is no way to set up a redundant path for the HA traffic without a team.

Sorry if I'm hijacking this thread, originally I just posting to say I was having this problem too.

darrellchapman · Tue Sep 07, 2010 4:26 pm

I had this same problem, except I was seeing more like 45% utilization on the synch channel. In one of my attempts to isolate the issue, I replaced the Cisco gigabit switch I was currently using for HA synch with a cheap D-link desktop gigabit switch. Although this switch does not support Jumbo Frames, network utilization went up to 80% even with none of the registry tweaks applied.

Aitor_Ibarra · Tue Sep 07, 2010 5:50 pm

Hi David,

OK, here's a brief spec:

each starwind box is a single Xeon 54xx (upgradeable to dual cpu), quad core, with 6GB of DDR3. RAID duties are handled by both on-mobo LSI SAS2 and Areca 1680ix-24. Drives are a real mix - from slow 5400 rpm 2TB up to Intel X25-M, but mostly for production I use Seagate Savvio (2.5", 10K, SAS). Each box has an Intel dual CX4 10Gbe NIC. The boxes are connected to each other using one port, and to Dell 64xx switches with the other. All clients are Supermicro twins, which have a single Intel CX4 port each, and run the previous gen Xeons & DDR2 RAM.

I'm running Windows 2008 R2 Hyper-V, and my production Starwind 4.2 is a VM. One of my 5.5 HA nodes is a VM on the same box, the other is a lone VM on the other box. Each Starwind VM has 1GB of RAM at the moment, and 4 virtual processors. I use IMG based targets, these are used as CSV volumes by my supermicro twins (a hyper-v cluster).

My current test is just to a 7200 rpm SATA RAID1. Because of HA + RAID1, all writes are going to four drives. I get up to 128MB/sec writes and 141MB/sec reads, but a) I'm using write through cache in starwind, b) I'm using write back cache on the Arecas (4GB cache) - the drives are not capable of those speeds. My average speeds are much lower, as I'm doing a lot of random tests as well as sequential (my current test is for data integrity after re-sync, NOT speed).

My production Hyper-V boxes rarely go above 30% utilisation of their 10GbE ports. I can bechmark at over 100MB/sec from within a VM but it does depend on what the target is and what other i/o is going on on that target.

I can understand why you don't want to mess with your NIC teams, but I would say to look at the MPIO policy on the iscsi intiators. In Windows the default is round robin, which will read/write from both nodes simultaneously. This can actually mean slower i/o with hard drives, because it will be *much* less sequential: both starwind boxes will have to read and write to their drives at the same time, although cache (both Starwind and Areca) should help here. If you change the MPIO policy to failover, so only one path is active at a time, then this should make the i/o more sequential, although I would still expect a drop if it's going over the same NIC.

As you are able to hit high speeds in non-HA, could be that what you're seeing is the max sequential speed that your RAID set can do. When you go HA, this drops, because of disk thrashing caused by sync. That's my theory, anyway!

awedio · Tue Sep 07, 2010 7:18 pm

Aitor,

Is the Intel X25-M your boot disk?
If it is, is it a single drive or 2x in RAID 1?

Femi

Aitor_Ibarra · Tue Sep 07, 2010 8:04 pm

Aitor,

Is the Intel X25-M your boot disk?
If it is, is it a single drive or 2x in RAID 1?

Femi

No, I boot off 10K SAS. X25-Ms are very nice boot disks for PCs, but for servers, boot time isn't so critical, as you (hopefully!) don't reboot very often. I use the X25-Ms in RAID-1 - in fact I pretty much only ever use RAID-1 or RAID-10. SSDs are great for VMs where you have multiple VMs running off the same disk - the random i/o performance really helps.

My servers can take 8x 3.5" disks and 12x 2.5" disks. I also have external chassis ready for when I need them.

ChrisB · Tue Sep 07, 2010 8:10 pm

Well we got in some different Gb NICs to try out for the sync channel. When first running the sync test and Wireshark, I did notice some CRC errors being generated. After disabling TCP offload, and re-running it, the packet trace looked perfectly normal - no oddities. Unfortunately, we were still hitting a bottleneck at 45-50 MB/s.

The NICs used were Marvell Yukon's, which gave the same result as our HP/Intel Quad port.

I also did a brief test of the onboard Broadcom NICs which yielded the same low performance.

awedio · Tue Sep 07, 2010 8:21 pm

....excuse my response, not trying to hijack this thread...
Aitor,

Using the X25-M's in RAID-1, are you not worried about the lack of TRIM?