Read slower than write - help me find the bottleneck

dfir · Sun Apr 30, 2017 8:04 pm

Hi all

I'm trying to max out performance in a simple setup, but it seems that I'm hitting some sort of a bottleneck and I cannot seem to find it.
I have a simple setup consisting of two servers - a VMware ESXi v6.5 and a Starwind server. They are connected back-to-back using a single 10 GbE link. Specs are defined in the bottom of this post.
The iSCSI store is placed on a RAID5 (64KB stripe size) with 8 x Intel DC S3500 SSD's. HW RAID Controller cache and Starwind cache is disabled. Enabling Starwind cache apparently worsens performance for some reason.
Jumbo frames are enabled.

When I run ATTO from within a VMware guest it is able to saturate the 10 GbE iSCSI link when writing (VMware to Starwind). But read is slower and not able to saturate the link (Starwind to VMware).
Only one VMware guest is located on the RAID array to ensure that no outside load is placed on the array during testing.

I believe the bottleneck is on the Starwind server as I've been able to dramatically increase both write and read performance by adjusting the NIC settings profile from "Standard Server" to "Low Latency" on the Starwind server.

Setup for the Starwind server is as follows:
Software:
OS: Server 2016
SW: StarWind Virtual SAN v8.0.0 (Build 10833, [SwSAN], Win64) (free edition)
Hardware:
1 x Intel Xeon X5670
24GB RAM
Intel X520 10GbE NIC (1 link dedicated for Starwind)
LSI 9265-8i RAID controller

VMware ESXi:
The hardware specs for the VMware ESXi server is identical to to the Starwind server.

ATTO benchmark directly on the RAID array on the Starwind Server:

: local storage.PNG (29.74 KiB) Viewed 69800 times

ATTO benchmark from within the VMware guest when using the Low Latency profile on the NIC (best performance I'm able to achieve) - 9K frames:

: profile-lowlatency.PNG (28.07 KiB) Viewed 69800 times

ATTO benchmark from within the VMware guest when using the Standard Server profile on the NIC - 9K frames:
Note that both read and write are significantly lower here

: profile-standardserver.PNG (25.85 KiB) Viewed 69800 times

I'm not complaining about performance - I would just like to be able to saturate the 10 GbE link in both directions as I think it should be possible.

Are there any tweaks that can be done within Starwind to increase performance in an All-Flash setup or any other ideas as to where I should look in order to get the last bit of juice out of this setup?

Wed May 03, 2017 9:59 pm

Hello dfir,

We see that write performance is limited by 10 Gb NIC.
In order to understand where read operations bottleneck is, please give us:
- a RAID configuration information,
- how StarWind storage is configured in VMware Guest

I am looking forward to hearing back from you.

dfir · Thu May 04, 2017 9:10 am

Hi Vitaliy

Thank you for your answer.

The RAID is made of:

LSI 9265-8i (newest firmware) controller that has 8 x Intel DC S3500 300GB SSD drives directly attached to the controller.
One RAID5 64K stripe size array with all caching disabled. Enabling caching seemingly makes performance worse.

: raid config.JPG (73.11 KiB) Viewed 69754 times

The iSCSI connection is between the Starwind server and the VMware host consists of 1 directly connected link dedicated to the iSCSI traffic only.
On the VMware host the iSCSI connection is done using the iSCSI Software Adapter where I added the Starwind IP address for dynamic discovery of the server.
On the Starwind server I have created a thick-provisioned 1TB store with all caching disabled and also created a 128GB store with read caching enabled (1GB in size) for testing purposes.

: starwind.JPG (26.23 KiB) Viewed 69753 times

I have created a VMFS6 datastore on the iSCSI target and moved only a single VM-guest to the datastore which I use for testing.

Are there anything specific you are looking for in terms of the iSCSI setup for the VMware part?

Best regards
dfir

dfir · Sun May 07, 2017 8:12 pm

I stand corrected

I've managed to get additional performance out of the ESXi host and I now therefore do not believe that the bottleneck is within the Starwind/Microsoft server.
My theory is now that the bottleneck is either in the VMware Software iSCSI Adapter and/or physical NIC related (perhaps improper RSS / Receive Side Scaling) or a combination of the two.

By the way - my testing VM guest OS is a Windows 2012 R2 assigned with 2 vCPUs, 4GB memory, VMXNET3 NIC and the VMware Paravirtual HDD controller. VMware Tools is updated to the latest version.

I've tried the following:

Update the X520 NIC driver (ixgbe) on the ESXi host from version: 3.7.13.7.14iov-NAPI to version: 4.5.1-iov

The update did not show any improvement in iSCSI throughput at all

Code: Select all

[root@ESXi01:~] ethtool -i vmnic3
driver: ixgbe
version: 3.7.13.7.14iov-NAPI
firmware-version: 0x18f10001
bus-info: 0000:04:00.0

[root@ESXi01:~] ethtool -i vmnic3
driver: ixgbe
version: 4.5.1-iov
firmware-version: 0x18f10001
bus-info: 0000:04:00.0

Disable interrupt moderation on the vmnic that is dedicated to iSCSI (by setting InterruptThrottleRate to 0)

This is what gave the throughput improvement - around 100 MB/s (from 860MB/s to 960MB/s)

Code: Select all

[root@ESXi01:~] esxcli system module parameters list -m ixgbe
Name                   Type          Value  Description
---------------------  ------------  -----  -------------------------------------------------------------------------------------------------------------------
AtrSampleRate          array of int         Software ATR Tx packet sample rate
CNA                    array of int         Number of FCoE Queues: 0 = disable, 1-8 = enable this many Queues
DRSS                   array of int         Number of Device Receive-Side Scaling Descriptor Queues, default 0=disable, 1- DRSS, IXGBE_MAX_RSS_INDICES = enable
FdirPballoc            array of int         Flow Director packet buffer allocation level:
                        0 = disable perfect filters (default)
                        1 = 2k perfect filters

IntMode                array of int         Change Interrupt Mode (0=Legacy, 1=MSI, 2=MSI-X), default 2
InterruptThrottleRate  array of int  0      Maximum interrupts per second, per vector, (956-488281), default 16000
InterruptType          array of int         Change Interrupt Mode (0=Legacy, 1=MSI, 2=MSI-X), default IntMode (deprecated)
LLIEType               array of int         Low Latency Interrupt Ethernet Protocol Type
LLIPort                array of int         Low Latency Interrupt TCP Port (0-65535)
LLIPush                array of int         Low Latency Interrupt on TCP Push flag (0,1)
LLISize                array of int         Low Latency Interrupt on Packet Size (0-1500)
LLIVLANP               array of int         Low Latency Interrupt on VLAN priority threshold
LRO                    array of int         Large Receive Offload (ESX5.1 or later): (0,1), default 0 = off
MDD                    array of int         Malicious Driver Detection: (0,1), default 1 = on
MQ                     array of int         Disable or enable Multiple Queues, default 1
RSS                    array of int         Number of Receive-Side Scaling Descriptor Queues, default 1-16=enables 4 RSS queues
VEPA                   array of int         VEPA Bridge Mode: 0 = VEB (default), 1 = VEPA
VMDQ                   array of int         Number of Virtual Machine Device Queues: 0/1 = disable (1 queue) 2-16 enable (default=8)
heap_initial           int                  Initial heap size allocated for the driver.
heap_max               int                  Maximum attainable heap size for the driver.
max_vfs                array of int         Number of Virtual Functions: 0 = disable (default), 1-63 = enable this many VFs
skb_mpool_initial      int                  Driver's minimum private socket buffer memory pool size.
skb_mpool_max          int                  Maximum attainable private socket buffer memory pool size for the driver.

: esxi-intmod-disabled-run1.PNG (26.9 KiB) Viewed 69703 times

Disable "Delayed ACK" on the VMware Software iSCSI Adapter

No improvement - same result
What do Starwind recommend in regards to this setting? I can see that there are varying recommandations, but it seems generally recommended to disable delayed ACK

Tried to fiddle around with the RSS settings

I was unable to get any improvement here, but I'm uncertain if I did it correctly and I don't know how to verify if RSS is even active or being properly used. <-- Any tips here would be GREATLY appreciated!
Tried the following:
Code: Select all
```
esxcli system module parameters set -m ixgbe -p "RSS=4,0"
esxcli system module parameters set -m ixgbe -p "RSS=8,0"
```
[/i]

Do you any of you have any suggestions as to what I can try next?
Thank you for your help so far.

Best regards
dfir

Fri May 12, 2017 4:47 pm

Hello dfir,

Your RAID configuratin looks according to our recommendations.

1. Could you confirm that MTU are also configured at 9000 in VMware?
2. Please, browse video with tweaks in VMware if you have never been watched:
https://www.youtube.com/watch?v=2_-TIpS ... .be&t=1806

dfir · Wed May 17, 2017 9:15 pm

Vitaliy (Staff) wrote:Hello dfir,

Your RAID configuratin looks according to our recommendations.

1. Could you confirm that MTU are also configured at 9000 in VMware?
2. Please, browse video with tweaks in VMware if you have never been watched:
https://www.youtube.com/watch?v=2_-TIpS ... .be&t=1806

Hi Vitaliy

Thank you for the answer. Jumbo frames have been confirmed:

Code: Select all

C:\Users\adm>ping -f -l 8972 192.168.255.34 -S 192.168.255.4

Pinging 192.168.255.34 from 192.168.255.4 with 8972 bytes of data:
Reply from 192.168.255.34: bytes=8972 time<1ms TTL=64
Reply from 192.168.255.34: bytes=8972 time<1ms TTL=64
Reply from 192.168.255.34: bytes=8972 time<1ms TTL=64
Reply from 192.168.255.34: bytes=8972 time<1ms TTL=64

Code: Select all

[root@ESXi01:~] vmkping -d -s 8972 192.168.255.4
PING 192.168.255.4 (192.168.255.4): 8972 data bytes
8980 bytes from 192.168.255.4: icmp_seq=0 ttl=128 time=0.294 ms
8980 bytes from 192.168.255.4: icmp_seq=1 ttl=128 time=0.477 ms
8980 bytes from 192.168.255.4: icmp_seq=2 ttl=128 time=0.490 ms

I watched the video you linked to (good one btw) and tried applying the VMware "fix" that applied to my setup, which was setting the disk.DiskMaxIOSize to 512 instead of 32767.
After changing it I rebooted the ESXi server and tried testing again, but the results were the same - no change in throughput.

The two other tweaks - multipathing and disk iops - are not applicable to my setup as ESXi is single connected to Starwind.

Fri May 26, 2017 11:51 am

Hello dfir,
I am sorry for the delay in our response.
Could you please clarify the what type of disk you have created on esxi node? It recommended to create thick provisioned eager zeroed disk for VM. Also please be sure that VM has enough CPU and RAM to handle the storage requests.
Thank you!

dfir · Mon Jul 03, 2017 2:26 pm

Hi Michael

Sorry for the delay.

All of my VMs are using disk type of Thick Provisioned Lazy Zeroed.
I have previously tried to allocated extra (lots) of CPU and memory and it made no difference. Also note that when I test upon a RAM disk in Starwind the performance is same.
I suspect that it is perhaps I/O related in the VMware Host, but I have no idea where to find this, if I'm hitting a performance ceiling of some kind...

Mon Jul 03, 2017 7:16 pm

Hello dfir,

dfir wrote: All of my VMs are using disk type of Thick Provisioned Lazy Zeroed.

We do recommend using Eager Zeroed. This option can boost your performance.
Is there any chance you can use it?

dfir · Tue Jul 04, 2017 7:03 pm

Yes, no problem. I will do a test and let you know the result.

Tue Jul 11, 2017 5:49 pm

Hello dfir,

Did you test your environment with Eager Zeroed?

dfir · Mon Sep 11, 2017 10:44 am

Vitaliy (Staff) wrote:Hello dfir,

Did you test your environment with Eager Zeroed?

Hi Vitaliy

I'm sorry for the late reply. I have tried to make my guest that I use for testing use a thick provisioned VMDK, but the result is the same.
I also tried updating the Starwind software to the newest one available (V8 build 11456 (08 August 2017)), which also did not show any difference.

As of now the performance is "still" the following.
I would really like to track down the culprit of not being able to saturate the VMguest read direction (from starwind to VM guest).

: perf-v8-b11456.JPG (90.71 KiB) Viewed 64119 times

Tue Sep 12, 2017 8:33 pm

Hello dfir,

Could you please create a case on support@starwind.com and title it as "Performance issue, forum request".
I think we need to do the remote session and browse your config.
Thank you in advance.

dfir · Tue Sep 19, 2017 11:22 am

I've created a support case and scheduled a remote session this thursday.

Tue Sep 19, 2017 2:18 pm

dfir,
Thanks for keeping us updated.