Good morning,
This is probably hardware or ROCEv2 network configuration related, however I am not sure how to debug it further.
Setup:
-------
Linux Debian 10.8, Supermicro Xeon platform
Linux mythmaster 5.10.0-0.bpo.3-amd64 #1 SMP Debian 5.10.13-1~bpo10+1 (2021-02-11) x86_64 GNU/Linux
Kernel drivers for Connectx-5
NVMe-OF target from the kernel.
Connectx-5 CX556A-ECAT latest firmware in x16 PCIe 3 slot
Windows 10 Pro 20H2, AMD Ryzen 3600 X570 platform
Mellanox WinOF-2 2.60
NVMe-OF target the latest from Star Wind
Connectx-5 CX556A-ECAT latest firmware in x4 PCIe 4 slot, using PCIe 3
Network is direct connect to each other, no switch.
Testing:
---------
Star Wind rping runs for minutes on -V verify without error.
Unsure how to correctly use Star Wind rperf to stress the link but it seems to work with all the settings I gave it.
RDMA counters on windows seem to show correct behavior and no dropped RDMA frames, however I am no expert and info is thin on how to diagnose.
Failing Testing software:
----------------------------
ATTO 4.01.0f1, Direct I/O (works without direct I/O)
Runs for several tests and then fails with the following message in windows event log:
Example error: The IO operation at logical block address 0x7835ec28 for Disk 3 (PDO name: \Device\000000aa) failed due to a hardware error.
Linux shows no messages at all, so I assume this is a fault somewhere on the windows side. Thoughts?
The Latest Gartner® Magic Quadrant™Hyperconverged Infrastructure Software