Current HCI industry highest performance: 26M IOPS questions

jdeshin · Tue Jul 20, 2021 7:45 pm

Hello support guys!
I read your paper https://www.starwindsoftware.com/hyperc ... high-score and have few questions:

Part of your hardware specification:
Platform: Supermicro SuperServer 2029UZ-TR4+
Networking: 2x Mellanox ConnectX-5 MCX516A-CCAT 100GbE Dual-Port NIC
Switch: 2x Mellanox SN2700 32 Spectrum ports 100GbE Ethernet Switch

The interconnection sheet:

: sheet1.png (48.6 KiB) Viewed 31923 times

How many ports do you use for sync?
Is sync NIC port(s) (1 on the sheet above) connected directly (as shown on picture below) or through a switch?

: sheet2.png (146.25 KiB) Viewed 31923 times

According your description, we should place one NIC per CPU as shown on figure below:

: sheet3.jpg (133.65 KiB) Viewed 31923 times

How should I connect P11-P22?

Can you please give me more explanation?

Best regards,
Yury

Thu Jul 22, 2021 4:09 pm

Thank you for your question. Each part of the study showcases the interconnections at the very beginning. And, yes, iSCSI and sync are going through switches.
GRID architecture implies having 2x Syncs and 2x iSCSI connections https://www.starwindsoftware.com/resour ... hitecture/. If you build a large cluster, you cannot provide enough ports for direct connections, that is why we used 2 syncs and 2 iSCSI.
The 2nd interconnect screenshot you have shared here is related to HCA interconnects which is a 2-node configuration with 1 SYNC and 1 iSCSI. That is the bare minimum configuration. There, we recommend direct connections over switched do ensure maximum redundancy.
The standard 2-node configuration is a "building block" of a GRID setup.

jdeshin · Sat Jul 24, 2021 7:01 am

Hello Yaroslav!
Thank you a lot for your information!
I have read the information about grid technology and it is impressed, but I cannot find any detailed information about interconnection sheet

.
I suppose that we have two possible variants for interconnection (please, see pictures below):

: base0.jpg (75.96 KiB) Viewed 31868 times

and

: crossnuma.jpg (75.93 KiB) Viewed 31868 times

In first case, we do not have cross-NUMA traffic, but if a switch goes off (may be for reboot) then full sync required after the switch goes on again. So, as I understand, this sheet can be used only for simple (two way) mirror.
In second case, we have full redundancy, but cross-NUMA traffic too.
What is the right sheet?

Sat Jul 24, 2021 7:28 am

Each article showcases a diagram like that.
You can see how NVMe and NICs are bound to different CPU sockets. Also, these diagrams demonstrate how servers are connected.

jdeshin · Sat Jul 24, 2021 8:21 am

Dear Yaroslav,
I remember your picture.
So, on your picture we have two switches and two dual port card.
Can you please, answer me for few questions:
1. Is the NIC on specified NUMA node (for example NUMA node 0) being used for sync and iscsi (for example port 1 for sync and port 2 for iscsi)?
2. Is the NIC on specified NUMA node connected to different switches?

Best regards,
Yury

Sat Jul 24, 2021 8:45 am

Hello,

On every server, each NUMA node has 1x Intel® SSD D3-S4510, 1x Intel® Optane™ SSD DC P4800X Series and 1x Mellanox ConnectX-5 100GbE Dual-Port NIC. Each NIC carries a set of iSCSI and Sync.

jdeshin · Sat Jul 24, 2021 8:51 am

OK, each NIC using for sync and iscsi (one port for sync and one for iscsi).

what about second question?
2. Is the NIC on specified NUMA node connected to different switches?

Best regards,
Yury

Sat Jul 24, 2021 9:20 am

Oops looks like I missed that :/ My sincere appologies.
Yes, they go to different switches.

jdeshin · Sat Jul 24, 2021 9:39 am

Like on picture below?

: parallel.jpg (64.24 KiB) Viewed 31850 times

Best regards,
Yury

Sat Jul 24, 2021 10:22 am

Yes, I believe.

jdeshin · Sat Jul 24, 2021 10:36 am

OK,
If one of switches goes off (for example reboot after firmware update) the all cluster will goes off too?

With best regards,
Yury

jdeshin · Sat Jul 24, 2021 10:46 am

Dear Yaroslav,
Additionally, I don't understand following thing:
You use dual port 100 Gbit NICs and 100 Gbit switches, the server that you use support only one PCIe 3.0 x16 slot and other slots are PCIe 3.0 x8.
The throughput of x16 slot is 128 Gbit/s, but we have dual-port NIC, therefore we can 100 Gbit/s on NIC port only in half duplex.
Other ports can give us only 64 Gbit/s throughput.
For which do you use 100 Gbit NICs?

Best regards,
Yury

Sat Jul 24, 2021 11:31 am

There are 3 switches in each diagram. If one or even 2 100 GBE switches go down, there is still one Management-heartbeat switch that helps to avoid StarWind VSAN and Cluster split-brain.
Regarding your question about adapters. We used 100 GBE NICs to avoid networking being a bottleneck. We also do not recommend teaming for iSCSI and Sync so 100 GBE adapters were the only option.

jdeshin · Sat Jul 24, 2021 12:19 pm

There are 3 switches in each diagram. If one or even 2 100 GBE switches go down, there is still one Management-heartbeat switch that helps to avoid StarWind VSAN and Cluster split-brain.

But it is not a solution, because the cluster will goes down when one of the switches goes down

.

We used 100 GBE NICs to avoid networking being a bottleneck

I think, you will get bottleneck in your PCIe lanes, so you cannot get 100 Gbit/s throughput.

I suspect, it's not that simple

Could you please consult with people, who made this test?

Best regards,
Yury

Sat Jul 24, 2021 12:48 pm

But it is not a solution, because the cluster will goes down when one of the switches goes down .

We set up redundant cluster communication links in each of our deploys. One link, e.g, Management, is for cluster communication and clients. Another one, Sync, is for Cluster-only communication. So, cluster split-brain is not likely to happen.

I think, you will get bottleneck in your PCIe lanes, so you cannot get 100 Gbit/s throughput.

Sure, will chat with them on this.