HCI Lab ng: All NVMe, 3x R740 with 10G and 100G

qwertz · Mon Dec 30, 2024 11:29 am

Good morning Gents!
Its been a while

My last Starwind powered Cluster I created / ran is probably like >10 years in the past.
(Starwind v8 back in 2014, i was one of the beta testers)
I "played" a lot with MS S2D and VSAN from VMware and went down all (... okay a lot of...) rabbit holes.
Now I am at a point where I have the liberty to setup the "next generation" Cluster for my Home-Lab, which is very exciting for me.
I took me years to get my hand on the right gear and put together some powerfull selection of hardware to push the limits further a bit.
My old cluster (S2D) was based on 2x Dell R720s with a three tiered storage: NVMe, SSD and Spinning Rust. Synced over two 40g ConnectX-2 Cards.
That worked pretty well. It ran for over 5 years and I never encountered unexpected outages while enjoying blazing fast performance.
Since i am a techy at heart and doing the same thing over and over again is kinda boring: I am now searching for a "new" way to do things and learn while doing that.

For the new HCI-Cluster I have three identical R740 Servers, each with:
- 2x Xeon(R) Gold 6134
- 256 GB DDR4 Memory
- 2x Mellanox ConnectX-3 VPI: each with 2x25GBit -> Total of 4x 25GBit
- 1x Mellanox ConnectX-3: with a single 100G Port
- 3x Samsung PM983 NVMe 1.92 TB
- 1x Intel DC P4608 NVMe 6.4 TB (gets exported as 2x 3.2 TB to the OS)

My Network around the cluster is "quiet" redundant, but not fully redundant yet:
L2 and L3 is redundant (MLAG and VRRP) for 10G. (two Mikrotik CCR2004 and two CRS317)
100G on the other hand: I only have one MT CRS504 currently. So my 100G links are NOT redundant, yet.
(CRS520 are quiet expensive and NOT considered WAT: Wife Accepted Technology... yet)

)

In my free time, over the last couple of weeks, I played with proxmox ve and ceph, and I learned a lot.
Its fun to play with and I`d like to push the solution further, but ceph isnt there yet.
EC-Pools are inefficient and there is A LOT OF OVERHEAD going on.
Mirror is faster, but also not AS FAST as i imagine things could be.
My Testing isn't very thorough yet, currently i am hitting the limits of this solution when writing with around 2600-2800 MB/s.
(Using Ethernet instead of RoCE.) Roce Support is terrible, from my perspective, with my current understanding.
It makes things slower, instead of faster. ¯\_(ツ)_/¯
I setup PFC and DCB, configured Lossless Ethernet on the CRS504 and made sure the traffic is in Class 3... but: still. "Horrible Performance". (Roce v1 and v2 are more or less performing equally bad in my lab)
My "gutt feeling" is that driver support for those mellanox cards could / should get better by a magnitude. Also current development of Ceph is focusing on improving ec-performance and efficiency... i think its very interesting, at least, what ceph will be able to do in maybe a year.

About my goals:
I`d like to be able to push the hardware in my home-lab as close to the edge (performance wise) as possible.
For the 5 nvme disks per node: Some sort of Parity instead of doing mirror would be really cool. (increase storage efficiency)
I`d love to use RoCEv2 for Storage-Sync.
It would be totally sufficient for my lab to run on two of the three servers, i only have three because the air gets very "thin" when it comes to solutions that work great with 2 nodes. So three is for the sake of testing and being able to evaluate different solutions. But: really not my end goal. (Power bill is already very very very *ouchy*)
I`d like to use the 100g Network as "first choice" for storage sync, but want to have it fail over to my 10g infra in case the 100g network goes down due to... well... SPOF.

Let me know what you think. I would be very interested to read what kind of setup you would recommend with your software products and what you would recommend to AVOID doing.

Have a great time!

Kind regards

Tue Dec 31, 2024 5:44 pm

Hi,

Thanks for your thorough description of the system.
I see ConnectX-3 driver and RoCE support as bottlenecks. iSCSI might be another possible pain for NVMe disks or SSDs. Parity is good for storage efficiency but will limit performance to a single disk minus penalty.
I'd suggest trying NVMe-oF in CVM once it is available. You will need initiator though. Proxmox and VMware have their initiators, Windows 2025 one is over TCP and might be not quite stable (so, take ours

).

Again, I am too woried about those ConnectX3 support by the OS.

qwertz · Thu Jan 02, 2025 5:32 pm

Hi Yaroslav!
Thanks for your response.
"I see ConnectX-3 driver and RoCE support as bottlenecks."
- Thats exactly what I fear as well, but: I don't intend to give up yet.

Today I finally found some time to get more familiar with your HCI-Appliance:
- redeployed proxmox ve 8.3.2 on all three clusternodes and wiped the remains of ceph from those 15 nvme disks.
- created a new cluster for this test
- installed 3 CVMs with Software Version: 1.6.578.7343
-- passed through 5 nvme disks per node to each cvm
-- connected and configured 3 nics per CVM: Management, iscsi, replication
-- connected appliances with each other
-- setup two Storage Pools per CVM: one raid-z with the 3 samsung pm983 and one mirror using the intel p4608:

: Starwind_Storage_pools.png (37.27 KiB) Viewed 14078 times

-- setup a bunch of volumes to play with
-- setup a LUN and exported it via nvmeof ... via tcp...
- connected to the lun from one of the proxmox hosts

: nvme_discover.png (60.57 KiB) Viewed 14078 times

and after connecting to it:

: nvme_list.png (35.99 KiB) Viewed 14078 times

- ran a few tests to see if its working at all: it is
<I had to paste the results as code instead of screenshots. not sure why the screenshots are causing problems.>
Write-Test:

Code: Select all

fio --name=write_test --filename=/dev/nvme0n1 --direct=1 --rw=write --bs=1M --size=1G --numjobs=1 --time_based --runtime=30 --group_reporting
write_test: (g=0): rw=write, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=psync, iodepth=1
fio-3.33
Starting 1 process
Jobs: 1 (f=1): [W(1)][100.0%][w=741MiB/s][w=741 IOPS][eta 00m:00s]
write_test: (groupid=0, jobs=1): err= 0: pid=20619: Thu Jan  2 17:23:58 2025
  write: IOPS=702, BW=703MiB/s (737MB/s)(20.6GiB/30001msec); 0 zone resets
    clat (usec): min=741, max=48823, avg=1328.32, stdev=1535.49
     lat (usec): min=809, max=48954, avg=1421.88, stdev=1535.92
    clat percentiles (usec):
     |  1.00th=[  914],  5.00th=[  979], 10.00th=[ 1004], 20.00th=[ 1037],
     | 30.00th=[ 1074], 40.00th=[ 1090], 50.00th=[ 1123], 60.00th=[ 1156],
     | 70.00th=[ 1188], 80.00th=[ 1254], 90.00th=[ 1418], 95.00th=[ 1713],
     | 99.00th=[ 6521], 99.50th=[10552], 99.90th=[24511], 99.95th=[32900],
     | 99.99th=[41681]
   bw (  KiB/s): min=632832, max=837632, per=99.84%, avg=718605.02, stdev=39688.67, samples=59
   iops        : min=  618, max=  818, avg=701.76, stdev=38.76, samples=59
  lat (usec)   : 750=0.01%, 1000=9.41%
  lat (msec)   : 2=87.32%, 4=1.70%, 10=1.01%, 20=0.41%, 50=0.14%
  cpu          : usr=6.93%, sys=4.25%, ctx=21099, majf=0, minf=12
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,21087,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=703MiB/s (737MB/s), 703MiB/s-703MiB/s (737MB/s-737MB/s), io=20.6GiB (22.1GB), run=30001-30001msec

Disk stats (read/write):
  nvme0n1: ios=91/168075, merge=0/0, ticks=44/147676, in_queue=147720, util=89.44%

Read-Test

Code: Select all

fio --name=read_test --filename=/dev/nvme0n1 --direct=1 --rw=read --bs=1M --size=1G --numjobs=1 --time_based --runtime=30 --group_reporting
read_test: (g=0): rw=read, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=psync, iodepth=1
fio-3.33
Starting 1 process
Jobs: 1 (f=1): [R(1)][100.0%][r=1651MiB/s][r=1651 IOPS][eta 00m:00s]
read_test: (groupid=0, jobs=1): err= 0: pid=20472: Thu Jan  2 17:23:14 2025
  read: IOPS=1563, BW=1563MiB/s (1639MB/s)(45.8GiB/30001msec)
    clat (usec): min=469, max=15179, avg=638.71, stdev=98.52
     lat (usec): min=469, max=15179, avg=638.81, stdev=98.52
    clat percentiles (usec):
     |  1.00th=[  545],  5.00th=[  578], 10.00th=[  586], 20.00th=[  603],
     | 30.00th=[  611], 40.00th=[  627], 50.00th=[  635], 60.00th=[  644],
     | 70.00th=[  652], 80.00th=[  668], 90.00th=[  693], 95.00th=[  709],
     | 99.00th=[  791], 99.50th=[  832], 99.90th=[  979], 99.95th=[ 1205],
     | 99.99th=[ 5080]
   bw (  MiB/s): min= 1438, max= 1684, per=99.94%, avg=1562.20, stdev=70.52, samples=59
   iops        : min= 1438, max= 1684, avg=1562.20, stdev=70.52, samples=59
  lat (usec)   : 500=0.08%, 750=98.01%, 1000=1.82%
  lat (msec)   : 2=0.07%, 4=0.01%, 10=0.01%, 20=0.01%
  cpu          : usr=0.35%, sys=6.67%, ctx=46902, majf=0, minf=268
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=46897,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: bw=1563MiB/s (1639MB/s), 1563MiB/s-1563MiB/s (1639MB/s-1639MB/s), io=45.8GiB (49.2GB), run=30001-30001msec

Disk stats (read/write):
  nvme0n1: ios=373790/0, merge=0/0, ticks=150851/0, in_queue=150851, util=94.78%

There a LOT OF THINGS I just rushed by to do that quick test... and i do have some questions

For example:
What do you recommend to do to get faster interfaces than the virtio ones?
- Currently all interfaces show a link speed of 10g in the appliance, no matter what interface is behind it: 10g, 25g, 100g...
...probably using sriov somehow, but I didn't figure out how to yet...

Fri Jan 03, 2025 4:26 pm

It can still work over TCP, I hope. Theoretically, it might even work over RDMA. The big question is how the hypervisor works with that NIC.
Are those FIO tests on Proxmox?
Returning to your questions, VirtIO display speed is 10 GbE regardless of the hardware you use (can be checked by concurrent iperf tests). What I am trying to say is that it is the display problem rather than the performance issue.
You could also try passing them through, yet, I am not sure if Ubuntu 20 LTS works well with them.