New VSAN Install, Poor iSCSI Performance

Software-based VM-centric and flash-friendly VM storage + free version

Moderators: anton (staff), art (staff), Max (staff), Anatoly (staff)

Post Reply
jeffgeno
Posts: 6
Joined: Tue Oct 08, 2019 6:10 pm

Tue Oct 08, 2019 6:29 pm

I've just installed VSAN on a two node Windows Server 2019 Hyper-V cluster. This is the trial version with the full GUI and feature set. The two servers have 2 x 1Gbps host network connection and two 10Gbps direct connections to each other. VSAN is set to replicate over the two 10Gbps connections and use the host network for management and heartbeat. All storage is SSD and is in RAID5 on a Dell H730 controller.

Creating a storage device and a HA replica worked correctly. The 8TB storage completed a full synchronization in an hour and 45 minutes, which is exactly what you'd expect at 10Gbps. I was able to connect iSCSI sessions with MPIO to both the loopback interface and the two sync interfaces.

When I copy a file to the new Cluster Shared Volume, speeds top out at 220 MB/s. When I copy from server to server directly over the 10Gbps connection, it runs at 900 MB/s. When I copy internally from folder to folder on the same disk, it's 1300 MB/s.

Is there anything I should have done beyond the defaults to get better performance?
jeffgeno
Posts: 6
Joined: Tue Oct 08, 2019 6:10 pm

Tue Oct 08, 2019 7:27 pm

I just did another test. It's actually poor CSV performance, and only when accessing the CSV from the C:\ of the host. It's very strange.

I connected the iSCSI volume directly to one of the hosts and copied a file to it. That came across at around 900 MB/s.

When I add the volume to Cluster Shared Volumes and copy a file to it on C:\ClusterStorageVolume1, the speed is about 250 MB/s.

However, if I copy a file on that very same server to \\localhost\c$\ClusterStorage\Volume1, the speed picks right back up to 900 MB/s.

So I'm very confused at what this might be. It seems to be a Microsoft issue with Server 2019 Failover Clustering.
batiati
Posts: 29
Joined: Mon Apr 08, 2019 1:16 pm
Location: Brazil
Contact:

Wed Oct 09, 2019 1:59 pm

You may check if the host is the owner of this CSV; When a host writes on other CSV, a little performance penalty is expected due metadata transfers besides the iSCSI, especially when you use ResFS istead NTFS on CSVs.

If you confirm it, it's likely that your cluster network is the bottleneck.

Regards
jeffgeno
Posts: 6
Joined: Tue Oct 08, 2019 6:10 pm

Wed Oct 09, 2019 7:03 pm

It is just for writing to the CSV (which is NTFS). But the owner node doesn't seem to make a difference. Reads happen at the expected speed, but writes are only 1/3 the speed of a non-CSV volume, no matter which node is the owner.
Boris (staff)
Staff
Posts: 805
Joined: Fri Jul 28, 2017 8:18 am

Thu Oct 10, 2019 7:57 am

Can you describe your configuration in more details? Particularly interesting are the following settings:
- number of sessions connecting targets to the hosts (local + partner);
- MPIO policy used for the CSV in question;
- when transferring the file, is the cluster network (management link) saturated?
jeffgeno
Posts: 6
Joined: Tue Oct 08, 2019 6:10 pm

Thu Oct 10, 2019 4:10 pm

There are three sessions to the targets, one local and two remote through each of the 10Gbps connections.

The MPIO policy is failover, with the local target the primary and the others as standby.

These servers are brand new and entirely unused, so there is no network bandwidth being used for anything except my testing.
Boris (staff)
Staff
Posts: 805
Joined: Fri Jul 28, 2017 8:18 am

Fri Oct 11, 2019 7:39 am

Connect the cluster shared volume with 2 (better) or 3 (still possible) loopback sessions, keep the partner sessions as is and set the target's MPIO policy to Least Queue Depth. For the majority of systems, this gives better performance.
jeffgeno
Posts: 6
Joined: Tue Oct 08, 2019 6:10 pm

Fri Oct 11, 2019 2:39 pm

Creating multiple loopback sessions actually made it worse. Switching from failover to LQD didn't make a huge difference. Here are my test results.
Control Test: VM on Local Storage

Image

Failover, One Loopback Connection

Image

LQD, One Loopback

Image

LQD, Three Loopbacks

Image
Boris (staff)
Staff
Posts: 805
Joined: Fri Jul 28, 2017 8:18 am

Wed Oct 16, 2019 2:43 pm

Can you try running the tests with a file size larger than that used? Something like dozens of GBs instead of 256MB.
jeffgeno
Posts: 6
Joined: Tue Oct 08, 2019 6:10 pm

Thu Oct 17, 2019 3:12 pm

I'm working with StarWind Support directly on this one. They had me not use sync channels for iSCSI. It didn't make a huge difference, but the Microsoft DiskSpd tool is showing better performance when the files and block sizes are very large. There must be a lot of overhead for both iSCSI and CSVs for smaller transfers.
Boris (staff)
Staff
Posts: 805
Joined: Fri Jul 28, 2017 8:18 am

Mon Oct 21, 2019 10:59 am

Whenever you get to the root cause of the issue, feel free to update the community.
Post Reply