Best practices for min latency / max throughput?

Initiator (iSCSI, FCoE, AoE, iSER and NVMe over Fabrics), iSCSI accelerator and RAM disk

Moderators: anton (staff), art (staff), Max (staff), Anatoly (staff)

Post Reply
Lex
Posts: 12
Joined: Wed Apr 30, 2025 1:37 am

Sat May 03, 2025 11:10 pm

Hi Folks, I am connected to NVMe-oF devices from a remote enterprise array via RoCE fabric switches (Arista), have applied what I believe are Mellanox best practices for lossless RoCEv2 on Windows, and have configured MPIO.

However, the minimum latency I am able to achieve (1T/QD1 reads of 4k) is in the >250us range, which is >4x higher than the same server/same adapters running Linux. Similarly, the max 4k read ops I can achieve is ~2.6M, vs the 5.4M on Linux.

I know not to expect the same performance, but 2-4x performance gap is too wide to accept. I have seen reviews about the StarWind initiator where reported latencies are sub 20us, so I have to assume there is a set of known best practice settings/configs that I haven't applied yet which could reduce the performance gap.

Windows settings, NIC config, Starwind options...I'm all ears for any best practice recipies you folks have followed. Let me know, thanks!
Lex
Posts: 12
Joined: Wed Apr 30, 2025 1:37 am

Sun May 04, 2025 12:18 am

additional info:

I'm leveraging 16 ports on the array side, and 2 x dual port NICs on the host side for connectivity...so I wonder about connection defaults:

I noticed that by default, numa node 0 is selected for all connections with firstcore 0. Does this peg the connection to that numa node? If so, I would have assumed the numa node which owns the adapter would be auto selected, but the documentation states nothing about selecting numa nodes. Do I have to manually identify which numa node the cards are connected to and force it?

How does it select ioqdepth and numioqs by default? Should those match something else I need to manually check?

Not sure if this will impact performance or not, but figured I would ask.
Lex
Posts: 12
Joined: Wed Apr 30, 2025 1:37 am

Sun May 04, 2025 3:58 am

Update: tested setting numa per connection to match and the numa mode in windows and don't see any appreciable difference.
yaroslav (staff)
Staff
Posts: 3459
Joined: Mon Nov 18, 2019 11:11 am

Sun May 04, 2025 2:59 pm

Thanks for your question!
Don't use MPIO unless you need it (i.e., using multiple paths): my colleagues reported a performance hit if you use MPIO. StarWind NVMe-oF Initiator works well in 1 session per link, so if you have 1 link, just go with a single NVMe-oF Session. If you have 2 or more links going to that array, MPIO might be useful (still use 1x per link)
As a side note, 1T/QD1 is not an optimal performance testing pattern for NVMe-oF.
Lex
Posts: 12
Joined: Wed Apr 30, 2025 1:37 am

Sun May 04, 2025 5:18 pm

I don't see a way to avoid Multipath if I want to ensure high availability. There will always be more than 1 path to a device for reliability reasons.

With that in mind, when you suggest 1 session per link, do you mean choose a single path, of the many available, for each physical path from the host?

My environment:
Host port a, b, c, d
Primary controller ports: 1, 2, 3, 4, 5, 6, 7, 8
Secondary controller ports: 9, 10, 11, 12, 13, 14, 15, 16

Are you suggesting I limit myself to 1 session from each of port a, b, c, and d? Meaning that I have to choose what array port I want it to be limited to (ie. a>1, b>9, c>5, d>13)? This would mean cutting potential performance in half should a controller go down...assuming I want the solution to be highly available. If you mean something else, please do be specific.
yaroslav (staff)
Staff
Posts: 3459
Joined: Mon Nov 18, 2019 11:11 am

Sun May 04, 2025 6:26 pm

I don't see a way to avoid Multipath if I want to ensure high availability.
Fair enough. You can't do HA without MPIO. The problem is that any form of replication implies a performance hit that depends on the number of mirrors.
To start with, what is the storage configuration? How does the cluster look? A diagram could help.
Also, we strongly recommend avoiding any VLANs or teaming (might be killing performance sometimes).

Try peeling off the layers: Start with a single host and one session to your target. See what the performance looks like.

p.s. if you use StarWind NVMe-oF highly available target, you might be a part of Beta. Let me know if that's the case.
Lex
Posts: 12
Joined: Wed Apr 30, 2025 1:37 am

Tue May 06, 2025 6:16 am

No replication, VLAN or Teaming is involved. The HA array controllers work as a single array from the hosts perspective. Attached is a full mesh network diagram of my environment.

Each array port has a unique IP, so I am making 16 connections to the array. Without MPIO, this results in 16 LUNs for EACH LUN presented to the host. A single session is a non-negotiable configuration as it would leave the host with no path to storage with just a single fabric failure. You could argue that I don't need all 16 ports, but I need at LEAST 4 connections to ensure I can max out the throughput of the available host side NICs.

I have tested with 4 and the performance is essentially unchanged.
Attachments
Screenshot 2025-05-05 at 11.30.19 PM.png
Screenshot 2025-05-05 at 11.30.19 PM.png (184.4 KiB) Viewed 4854 times
yaroslav (staff)
Staff
Posts: 3459
Joined: Mon Nov 18, 2019 11:11 am

Tue May 06, 2025 8:04 am

Thanks for your really helpful explanation.
I understand that "no MPIO " is not an option. Still, purely for test purposes, I would like to try a single connection to see if it is MPIO eating performance.
Post Reply