Current HCI industry highest performance: 26M IOPS questions

Software-based VM-centric and flash-friendly VM storage + free version

Moderators: anton (staff), art (staff), Max (staff), Anatoly (staff)

jdeshin
Posts: 63
Joined: Tue Sep 08, 2020 11:34 am

Tue Jul 20, 2021 7:45 pm

Hello support guys!
I read your paper https://www.starwindsoftware.com/hyperc ... high-score and have few questions:

Part of your hardware specification:
Platform: Supermicro SuperServer 2029UZ-TR4+
Networking: 2x Mellanox ConnectX-5 MCX516A-CCAT 100GbE Dual-Port NIC
Switch: 2x Mellanox SN2700 32 Spectrum ports 100GbE Ethernet Switch

The interconnection sheet:
sheet1.png
sheet1.png (48.6 KiB) Viewed 28121 times
How many ports do you use for sync?
Is sync NIC port(s) (1 on the sheet above) connected directly (as shown on picture below) or through a switch?
sheet2.png
sheet2.png (146.25 KiB) Viewed 28121 times
According your description, we should place one NIC per CPU as shown on figure below:
sheet3.jpg
sheet3.jpg (133.65 KiB) Viewed 28121 times
How should I connect P11-P22?

Can you please give me more explanation?

Best regards,
Yury
yaroslav (staff)
Staff
Posts: 2356
Joined: Mon Nov 18, 2019 11:11 am

Thu Jul 22, 2021 4:09 pm

Thank you for your question. Each part of the study showcases the interconnections at the very beginning. And, yes, iSCSI and sync are going through switches.
GRID architecture implies having 2x Syncs and 2x iSCSI connections https://www.starwindsoftware.com/resour ... hitecture/. If you build a large cluster, you cannot provide enough ports for direct connections, that is why we used 2 syncs and 2 iSCSI.
The 2nd interconnect screenshot you have shared here is related to HCA interconnects which is a 2-node configuration with 1 SYNC and 1 iSCSI. That is the bare minimum configuration. There, we recommend direct connections over switched do ensure maximum redundancy.
The standard 2-node configuration is a "building block" of a GRID setup.
jdeshin
Posts: 63
Joined: Tue Sep 08, 2020 11:34 am

Sat Jul 24, 2021 7:01 am

Hello Yaroslav!
Thank you a lot for your information!
I have read the information about grid technology and it is impressed, but I cannot find any detailed information about interconnection sheet :(.
I suppose that we have two possible variants for interconnection (please, see pictures below):
base0.jpg
base0.jpg (75.96 KiB) Viewed 28066 times
and
crossnuma.jpg
crossnuma.jpg (75.93 KiB) Viewed 28066 times
In first case, we do not have cross-NUMA traffic, but if a switch goes off (may be for reboot) then full sync required after the switch goes on again. So, as I understand, this sheet can be used only for simple (two way) mirror.
In second case, we have full redundancy, but cross-NUMA traffic too.
What is the right sheet?
yaroslav (staff)
Staff
Posts: 2356
Joined: Mon Nov 18, 2019 11:11 am

Sat Jul 24, 2021 7:28 am

Each article showcases a diagram like that.
You can see how NVMe and NICs are bound to different CPU sockets. Also, these diagrams demonstrate how servers are connected.
Attachments
e.g.
e.g.
Screenshot_1.png (239.7 KiB) Viewed 28063 times
jdeshin
Posts: 63
Joined: Tue Sep 08, 2020 11:34 am

Sat Jul 24, 2021 8:21 am

Dear Yaroslav,
I remember your picture.
So, on your picture we have two switches and two dual port card.
Can you please, answer me for few questions:
1. Is the NIC on specified NUMA node (for example NUMA node 0) being used for sync and iscsi (for example port 1 for sync and port 2 for iscsi)?
2. Is the NIC on specified NUMA node connected to different switches?

Best regards,
Yury
yaroslav (staff)
Staff
Posts: 2356
Joined: Mon Nov 18, 2019 11:11 am

Sat Jul 24, 2021 8:45 am

Hello,

On every server, each NUMA node has 1x Intel® SSD D3-S4510, 1x Intel® Optane™ SSD DC P4800X Series and 1x Mellanox ConnectX-5 100GbE Dual-Port NIC. Each NIC carries a set of iSCSI and Sync.
jdeshin
Posts: 63
Joined: Tue Sep 08, 2020 11:34 am

Sat Jul 24, 2021 8:51 am

OK, each NIC using for sync and iscsi (one port for sync and one for iscsi).

what about second question?
2. Is the NIC on specified NUMA node connected to different switches?

Best regards,
Yury
yaroslav (staff)
Staff
Posts: 2356
Joined: Mon Nov 18, 2019 11:11 am

Sat Jul 24, 2021 9:20 am

Oops looks like I missed that :/ My sincere appologies.
Yes, they go to different switches.
jdeshin
Posts: 63
Joined: Tue Sep 08, 2020 11:34 am

Sat Jul 24, 2021 9:39 am

Like on picture below?
parallel.jpg
parallel.jpg (64.24 KiB) Viewed 28048 times
Best regards,
Yury
yaroslav (staff)
Staff
Posts: 2356
Joined: Mon Nov 18, 2019 11:11 am

Sat Jul 24, 2021 10:22 am

Yes, I believe.
jdeshin
Posts: 63
Joined: Tue Sep 08, 2020 11:34 am

Sat Jul 24, 2021 10:36 am

OK,
If one of switches goes off (for example reboot after firmware update) the all cluster will goes off too?

With best regards,
Yury
jdeshin
Posts: 63
Joined: Tue Sep 08, 2020 11:34 am

Sat Jul 24, 2021 10:46 am

Dear Yaroslav,
Additionally, I don't understand following thing:
You use dual port 100 Gbit NICs and 100 Gbit switches, the server that you use support only one PCIe 3.0 x16 slot and other slots are PCIe 3.0 x8.
The throughput of x16 slot is 128 Gbit/s, but we have dual-port NIC, therefore we can 100 Gbit/s on NIC port only in half duplex.
Other ports can give us only 64 Gbit/s throughput.
For which do you use 100 Gbit NICs?

Best regards,
Yury
yaroslav (staff)
Staff
Posts: 2356
Joined: Mon Nov 18, 2019 11:11 am

Sat Jul 24, 2021 11:31 am

There are 3 switches in each diagram. If one or even 2 100 GBE switches go down, there is still one Management-heartbeat switch that helps to avoid StarWind VSAN and Cluster split-brain.
Regarding your question about adapters. We used 100 GBE NICs to avoid networking being a bottleneck. We also do not recommend teaming for iSCSI and Sync so 100 GBE adapters were the only option.
jdeshin
Posts: 63
Joined: Tue Sep 08, 2020 11:34 am

Sat Jul 24, 2021 12:19 pm

There are 3 switches in each diagram. If one or even 2 100 GBE switches go down, there is still one Management-heartbeat switch that helps to avoid StarWind VSAN and Cluster split-brain.
But it is not a solution, because the cluster will goes down when one of the switches goes down :(.
We used 100 GBE NICs to avoid networking being a bottleneck
I think, you will get bottleneck in your PCIe lanes, so you cannot get 100 Gbit/s throughput.

I suspect, it's not that simple :)
Could you please consult with people, who made this test?

Best regards,
Yury
yaroslav (staff)
Staff
Posts: 2356
Joined: Mon Nov 18, 2019 11:11 am

Sat Jul 24, 2021 12:48 pm

But it is not a solution, because the cluster will goes down when one of the switches goes down :(.
We set up redundant cluster communication links in each of our deploys. One link, e.g, Management, is for cluster communication and clients. Another one, Sync, is for Cluster-only communication. So, cluster split-brain is not likely to happen.
I think, you will get bottleneck in your PCIe lanes, so you cannot get 100 Gbit/s throughput.
Sure, will chat with them on this.
Locked