HA Performance Issues - MPIO Round Robin

jeffhamm · Wed Mar 14, 2012 6:00 pm

I'm having an HA performance Issue involving MPIO on my Windows 2008 R2 Servers connecting to the back-end StarWind Storage. We have a 2 node HA Cluster running version 5.8.1964. The issue is that when we specify all 4 paths to our storage in MPIO, the performance is only about 20% of what we get otherwise. In other test scenarios we can fully saturate the Gigabit links (99%), but with HA and Round Robin to all 4 paths we top out at 20%

We have been doing testing using HA Targets created with RAM disks on the StarWind nodes to take the underlying disk i/o subsystem out of the equation. All testing has been done using IOMeter using some metrics mentioned in other posts (4 workers / 64 outstanding IOs / 32K 100% read, etc)

Here is our basic setup on each StarWind Node:
- Single 10 Gigabit crossover between the two nodes for Sync Channel
- 2 Gigabit Ethernet NICs on separate subnets for ISCSI traffic

Here is our basic setup on the Windows 2008 R2 servers connecting to the Starwind Nodes:
- 2 Gigabit Ethernet NICs dedicated for Storage Traffic

The issue seems to be related to using more than one path in MPIO per physical NIC. If I hard-code the Round Robin Policy to use only one Active Path per physical NIC then my HA performance jumps to fully saturating both Gigabit NIC connections to 99% utilization. And it does not seem to matter whether I specify all the traffic to go only to one of the two nodes or to both nodes. The issue does not seem to be either "HA-related" or "node-related", but instead just related to MPIO:

: MPIO Screen Shot.png (10.69 KiB) Viewed 24661 times

Any ideas?

Aitor_Ibarra · Thu Mar 15, 2012 12:41 pm

Do you have one or two switches? If two, are they interlinked? And if they are interlinked, how? Are your two iSCSI subnets on seperate VLANs with seperate LAGs between the switches?

Do you get better performance if you use one of the other MPIO modes like Least Blocks?

Fri Mar 16, 2012 4:06 pm

Dear jeffham,

Have you tested your synchronization channel (NTttcp and IOmeter)?
If not could you please do so and share the results with us?

jeffhamm · Sun Mar 18, 2012 8:19 pm

Aitor,

I have two HP ProCurve Gigabit Switches. They are interconnected via two 10 Gigabit Ethernet ports that are trunked together as HP trunk ports. The iSCSI subnets are on separate VLANs. The Trunk Ports carry the traffic for both VLANS, but how I currently have it cabled is that all of the NICs for one subnet are plugged into switch "A", and all of the NICs for the other subnet are plugged into switch "B". So there is no traffic traveling between the two switches with regards to iSCSI.

When I switch the MPIO mode to Least Blocks, it increases the utilization to 75%, but not 100% like when I configure the policy as Round Robin with Subset.

jeffhamm · Sun Mar 18, 2012 8:19 pm

Anatoly - here are my results for the 10 Gigabit Sync channel:

NTttcp - 67% - 6,661.305 Mbit/s

IOMeter - 67% - 775 MG/s

Mon Mar 19, 2012 4:21 pm

Well as for me the NTttcp result is pointing at some networking issue.

jeffhamm · Mon Mar 19, 2012 6:39 pm

Anatoly - you advised that it looked like a network issue to you. What in the results I sent points to that? What numbers should I expect to see using NTttcp?

Thanks!
Jeff

Tue Mar 20, 2012 11:24 am

At least 90% of network utilization. Try to change request that you have used. If this wont help then I`d recommend to play with Jumbo Frames.

jeffhamm · Thu Mar 22, 2012 4:28 pm

In digging around the windows event logs, It looks like my 10 Gigabit Cards are installed in an x4 rather than an x8 PCI-express slot. Hope to get them moved over and hopefully that will fix it.

Fri Mar 23, 2012 1:22 pm

I think it`ll give some result.
Alrighty, lets wait for update then.

Aitor_Ibarra · Fri Mar 23, 2012 3:08 pm

The x4 connection in x8 slot event is an interesting one. I have seen this before with my Intel 10GbE NICs (even though my motherboard has only x8 slots) and also with Areca RAID cards where sometimes at boot they would negotiate a x4 connection even though they should have been x8.

There are several possibilities here, assuming that the NIC is in a x8 slot:

1) It's a false positive. The NIC driver thinks the card has a x4 connection even if it has x8. This could be a bug in the driver, or NIC firmware, and might not be affecting actual throughput.

2) There's a bug in the firmware - at POST time, before the the OS (and therefore the driver) have loaded, when it comes to negotiating PCIe speed with the system, the NIC firmware is doing something wrong. Worth checking to see if there are firmware updates for your NIC.

3) System BIOS causing the negotiation problem - see if there's an update for your motherboard

4) Too many PCIe lanes in use. If you have lots of PCIe cards, you need to check how many PCIe lanes your chipset actually has. E.g. Your system may have 5 x8 slots. If your chipset has less than 40 lanes available for use by expansion cards, and you have several x8 cards, you may not have enough lanes for all the cards to get their full allocation.

Having said all that, I'd have thought that x4 PCIe 2.0 is enough for 2x 10GbE ports providing that they are not both at 100% full duplex simultaneously. But that would assume good load balancing, it could be that it is suboptimal without a full x8. There is one NIC I know of that is 6x 10GbE / PCIe 2.0 x16 and claims wirespeed (ie 120Gbit/sec FDX) if used in x16 slot. That would imply that you need about 2.66 lanes per 10GbE port to run full duplex.

Mon Mar 26, 2012 2:14 am

There's another case... It may simply not work. We have a collection of mainboards here @ our lab working with a single, dual or triple 10 GbE NICs and doing wire speeds. And also a collection of (surprisingly server!) mainboards not willing to do wire speed even with single 10 GbE NIC. So... ALWAYS TEST WHAT YOU WANT TO PUT INTO PRODUCTION. That's a pain and gain being a software SAN/NAS - you can use hardware of choice but you may easily assemble something non-working from the scratch.

Aitor_Ibarra · Wed Mar 28, 2012 9:25 am

Could you name and shame those motherboards? Or more usefully, see if there is some common factor - e.g. manufacturer, chipset. It could be that even if it has proper x8 slots they are going through some kind of hub so the overall system bandwidth is limited.

A 10GbE card is going to be pushing a lot less data than a decent graphics card and server boards are supposed to be built for i/o throughput so it is a bit shocking if a server motherboard can't sustain the data rates.

Have you had a chance to look at any of the PCIe 3.0 stuff on a new server (the new Xeons) with 40GbE? Desktop systems (essentially gaming rigs) have had PCIe 3.0 for a while now, so they were the only option until recently.

Wed Mar 28, 2012 3:36 pm

Hi Aitor,
Intel S5520HC
The description is here:
http://communities.intel.com/thread/20591?tstart=0
http://communities.intel.com/message/120136#120136

The problem was with Intel S5520HC motherboards. Using Intel 10 Gigabit AT cards that are in tested hardware list showed only Gb/s value.

Motherboards that show 10 Gb/s performance:
Gigabyte P55-UD3L
ASUS Maximus IV Extreme-Z
Gigabyte Z68X-UD7-B3
Gigabyte X79-UD3

jeffhamm · Wed Mar 28, 2012 8:43 pm

OK, after moving my 10 Gigabit cards to the x8 slots, and updating the BIOS on both servers, I can now get 99% utilization on my 10 Gigabit Sync channel using NTttcp (99% - 9,831.570 Mbit/s)

But, I'm still getting poor results using Iometer with Full Round Robin to all 4 paths to the storage (30-35 MB/s 10-15% utilization on both Gigabit Links). I'm only able to get good performance when I select "Round Robin with Subset" and manually select 2 of the paths (240-250 MB/s 99% utilization on both Gigabit Links)

Any other ideas on why Full Round Robin is still getting such bad numbers?

Thanks!
Jeff