Best practices for min latency / max throughput?

Lex · Sat May 03, 2025 11:10 pm

Hi Folks, I am connected to NVMe-oF devices from a remote enterprise array via RoCE fabric switches (Arista), have applied what I believe are Mellanox best practices for lossless RoCEv2 on Windows, and have configured MPIO.

However, the minimum latency I am able to achieve (1T/QD1 reads of 4k) is in the >250us range, which is >4x higher than the same server/same adapters running Linux. Similarly, the max 4k read ops I can achieve is ~2.6M, vs the 5.4M on Linux.

I know not to expect the same performance, but 2-4x performance gap is too wide to accept. I have seen reviews about the StarWind initiator where reported latencies are sub 20us, so I have to assume there is a set of known best practice settings/configs that I haven't applied yet which could reduce the performance gap.

Windows settings, NIC config, Starwind options...I'm all ears for any best practice recipies you folks have followed. Let me know, thanks!

Lex · Sun May 04, 2025 12:18 am

additional info:

I'm leveraging 16 ports on the array side, and 2 x dual port NICs on the host side for connectivity...so I wonder about connection defaults:

I noticed that by default, numa node 0 is selected for all connections with firstcore 0. Does this peg the connection to that numa node? If so, I would have assumed the numa node which owns the adapter would be auto selected, but the documentation states nothing about selecting numa nodes. Do I have to manually identify which numa node the cards are connected to and force it?

How does it select ioqdepth and numioqs by default? Should those match something else I need to manually check?

Not sure if this will impact performance or not, but figured I would ask.

Lex · Sun May 04, 2025 3:58 am

Update: tested setting numa per connection to match and the numa mode in windows and don't see any appreciable difference.

Sun May 04, 2025 2:59 pm

Thanks for your question!
Don't use MPIO unless you need it (i.e., using multiple paths): my colleagues reported a performance hit if you use MPIO. StarWind NVMe-oF Initiator works well in 1 session per link, so if you have 1 link, just go with a single NVMe-oF Session. If you have 2 or more links going to that array, MPIO might be useful (still use 1x per link)
As a side note, 1T/QD1 is not an optimal performance testing pattern for NVMe-oF.

Lex · Sun May 04, 2025 5:18 pm

I don't see a way to avoid Multipath if I want to ensure high availability. There will always be more than 1 path to a device for reliability reasons.

With that in mind, when you suggest 1 session per link, do you mean choose a single path, of the many available, for each physical path from the host?

My environment:
Host port a, b, c, d
Primary controller ports: 1, 2, 3, 4, 5, 6, 7, 8
Secondary controller ports: 9, 10, 11, 12, 13, 14, 15, 16

Are you suggesting I limit myself to 1 session from each of port a, b, c, and d? Meaning that I have to choose what array port I want it to be limited to (ie. a>1, b>9, c>5, d>13)? This would mean cutting potential performance in half should a controller go down...assuming I want the solution to be highly available. If you mean something else, please do be specific.

Sun May 04, 2025 6:26 pm

I don't see a way to avoid Multipath if I want to ensure high availability.

Fair enough. You can't do HA without MPIO. The problem is that any form of replication implies a performance hit that depends on the number of mirrors.
To start with, what is the storage configuration? How does the cluster look? A diagram could help.
Also, we strongly recommend avoiding any VLANs or teaming (might be killing performance sometimes).

Try peeling off the layers: Start with a single host and one session to your target. See what the performance looks like.

p.s. if you use StarWind NVMe-oF highly available target, you might be a part of Beta. Let me know if that's the case.

Lex · Tue May 06, 2025 6:16 am

No replication, VLAN or Teaming is involved. The HA array controllers work as a single array from the hosts perspective. Attached is a full mesh network diagram of my environment.

Each array port has a unique IP, so I am making 16 connections to the array. Without MPIO, this results in 16 LUNs for EACH LUN presented to the host. A single session is a non-negotiable configuration as it would leave the host with no path to storage with just a single fabric failure. You could argue that I don't need all 16 ports, but I need at LEAST 4 connections to ensure I can max out the throughput of the available host side NICs.

I have tested with 4 and the performance is essentially unchanged.

Tue May 06, 2025 8:04 am

Thanks for your really helpful explanation.
I understand that "no MPIO " is not an option. Still, purely for test purposes, I would like to try a single connection to see if it is MPIO eating performance.