starwind, starport and performance

Software-based VM-centric and flash-friendly VM storage + free version

Moderators: anton (staff), art (staff), Max (staff), Anatoly (staff)

Mark

Wed Jun 30, 2004 3:54 pm

I've been testing starwind (under windows 2003) and starport (under 2003 and xp) and have noticed a performace drop when accessing a starwind IMG compared with accessing the underlying disk directly. Of course, this is to be expected (driver overhead, network speed and conditions, etc).

Forgetting about network speed, etc... what would you say the performance hit is for the starwind/starport drivers on their own? How does this compare with SMB (CIFS) file sharing?

One other quick question, if I may: Is there a way to modify the starwind configuration without restarting the service? I don't think the machines running starport with disks mounted like it very much when I restart the service while they are using the disk.

Thanks,

Mark Basco

Vancouver, BC, Canada
Guest

Wed Jun 30, 2004 4:31 pm

Mark,

1) performance drops are function of two input parameters:

- network latency. For good quality GbE it should not be an issue at all.

- the way IMG files are processed. AFAIK we disable write caching for IMG files now. Actually whole RAM on iSCSI target machine can be used as a huge disk cache resulting dramatically increased speed for writes (limited with network wire speed only) and very high cache hit ratio for following reads. However this is very unsafe! If PSU on iSCSI target machine would be damaged you can lot of data would be lost. NTFS lazy writer has no idea about hard disk with gigabyte cache buffer :)

When we were experimenting with FC arrays (4*10K rpm disks in RAID zero -- stripe) performance drops for "networking" were around 10-15% (quite a reasonable numbers, aren't they?). I'll re-run the tests ASAP to check the things once again -- maybe something got broken. Can you share your own test results with us? IOmeter if you please?

2) With StarPort running Vs. StarWind (different target and initiator software can show very different results) we were able to get as much as 155 MBps of sustained transfer rate (reads+writes). SMB over the same GbE connection (CSA gigabit ethernet on both sides, point-to-point hubless configuration) was able to get around 20-25 MBps of files. Network utilization for SMB just sucks. It would never reach raw SAN speed in any configuration. Just forget it!

BTW, when we'll move StarWind -> kernel (and we're busy with this right now) I'd expect something like 175-180 MBps on existing hardware (full-duplex of course).

3) Yes, there is a way to control StarWind with the help of telnet and with own GUI application. Would you please write me down to support@starwind.com and I'll ask our egineers to send you telnet configuration user manual (latest draft) and congigurator GUI (currenlty in beta, expected to be released in 4 weeks or so).

Thank you! :)
Codiak6335

Fri Aug 06, 2004 5:13 am

I've started to benchmark my config.
Currently, I using two machines

Initator Target
Starport Starwind
3.0 HT Pentium 4, 1GB Ram Athlon 64 3200+ with 512
device 2 SATA in a raid 0 config
Linksys 10/100/1000 Nic Linksys 10/100/1000 Nic
Full Duplex, locked at 1000 Full Duplex, locked at 1000



Any input or suggestions/documentation that can speed the process would be most welcome. :P


Outstanding questions/todo:
Network utilization is not exceeding 25% yet only seeing 45.7MB a sec on remote system.

Local drive testing on Target showed 160MB a sec, yet ISCSI results of 45.7MB seem disappointing.

IOZone and HBTach used for testing... should switch to Bonnie++... (need to remain with cross platform util as a Linux Target will be compared)

Linux 2.6 64bit
User avatar
anton (staff)
Site Admin
Posts: 4010
Joined: Fri Jun 18, 2004 12:03 am
Location: British Virgin Islands
Contact:

Fri Aug 06, 2004 12:50 pm

First of all GbE cannot pump more then 117 MB/sec for completely loaded data pipe. Thus it does not matter how fast your local drive is: you'll never seen more then 110-115 MB/sec going into the same direction. Second how fast is your GbE card itself? Can you run NTttcp & IPerf? B/s if it's PCI 32-bit/33MHz you'll never see more then 50-60 MB/sec. If you wish to stick with Athlon 64 I'd recommend using nVidia nForce3 250Gb based mobo with GbE embedded to south (?) bridge (bypassing PCI bus). Close to what Intel does with CSA stuff on i857/i865 machines.

We've used IOmeter and HDtach if you care. IOmeter can be used to emulate heavy loads. HDtach is far from perfect from this point of view.
Regards,
Anton Kolomyeytsev

Chief Technology Officer & Chief Architect, StarWind Software

Image
Codiak6335

Fri Aug 06, 2004 6:22 pm

Anton,
I must admit to missing the 32bit nature of the NIC (smacks self) I've been in the 64bit world so long I take it for granted.

As for the expected throughput I don't expect 160KBs, just trying to get as close as possible. I can always add more nics later.

I am confused though... earlier in the thread it was stated "
2) With StarPort running Vs. StarWind (different target and initiator software can show very different results) we were able to get as much as 155 MBps of sustained transfer rate (reads+writes). SMB over the same GbE connection (CSA gigabit ethernet on both sides, point-to-point hubless configuration) was able to get around 20-25 MBps of files. Network utilization for SMB just sucks. It would never reach raw SAN speed in any configuration. Just forget it!

BTW, when we'll move StarWind -> kernel (and we're busy with this right now) I'd expect something like 175-180 MBps on existing hardware (full-duplex of course).


How were the stated results achieved? ;)
User avatar
anton (staff)
Site Admin
Posts: 4010
Joined: Fri Jun 18, 2004 12:03 am
Location: British Virgin Islands
Contact:

Fri Aug 06, 2004 8:08 pm

Bus width has nothing to do with CPU addressing bits. My Sun UltraSPARC f.e. is true 64-bit machine thus having PCI 32-bit/33MHz expansion slots. The same statement is true for your case: most (if not all) of Athlon64 based mobos have only 32-bit/33MHz PCI bus (single). So 132 MBps PCI wire speed (theoretical, not more then 100 MBps practical) is shared between all of the PCI devices. That's why nVidia and Intel had moved GbE NICs out from PCI bus in last generation of pre PCI-E boards. Our test 32-bit/33MHz 3Com cards do not allow to wire more then 50-60 MBps (Alex could correct be as these are his test beds).

155 MBps of STR I've mentioned is CSA-based (i875P "Bonanza" borads)
GbE controllers on both sides, native RAM disk configured on SCSI target and I/O meter in full-duplex (mixed reads and writes in 50/50 ratio) configuration with very deep I/O queue.

I would really recommend you to run native TCP tests I've pointed (I can even send you the binaries and pre-configured execution scripts if I've be issued with your personal e-mail address). So you'll be able to determine with 100% possibility how fast can you storage-over-IP link work.
Regards,
Anton Kolomyeytsev

Chief Technology Officer & Chief Architect, StarWind Software

Image
Guest

Fri Aug 06, 2004 8:57 pm

I intend to run the network tests tonight. Taking your suggestion and doing some more research I'm convinced the bus is the limiting factor.

My 64bit comment wasn't related to the commodity world of Intel and Amd ;)
User avatar
anton (staff)
Site Admin
Posts: 4010
Joined: Fri Jun 18, 2004 12:03 am
Location: British Virgin Islands
Contact:

Sat Aug 07, 2004 9:35 am

Just go ahead! It would be really nice if you'd find some time to share your results with us :)
Anonymous wrote:I intend to run the network tests tonight. Taking your suggestion and doing some more research I'm convinced the bus is the limiting factor.

My 64bit comment wasn't related to the commodity world of Intel and Amd ;)
Regards,
Anton Kolomyeytsev

Chief Technology Officer & Chief Architect, StarWind Software

Image
Codiak6335

Sun Aug 08, 2004 7:34 am

Couldn't find a local vendor with a nForce3 motherboard. And the only Intel manufactured MB was a i865 with 10/100.

I checked with Asus on their P4P800 SE MB with integrated GE, and setup a another MB with embedded GE. All things are NOT the same.... :x

Using the RamDrive I'm getting 56.8 MB, Hard drives 45.6 MB.

iperf is showing ~340 Mb/s at 8k windows.... ~600 Mb/s at 512k windows.
And I'm capped...

I found one forum thread that leads me to believe that the Asus MB didn't implement CSA, but runs the traffic across the PCI bus. The Marvell chipset that Asus used also doesn't allow me to lock at 1Gb Full Duplex.

What my other machine is doing is a mystery as well.

I may just order a true Intel MB and a Gigibyte nForce3 Monday.
Diffently a pain in the arse trying to maximize throughput.

--------------------------------------------------------------------------------
On the flip side... Mapped a network share and benched it at 25MB/s so I'm convinced there a gold at the end of this rainbow.
User avatar
anton (staff)
Site Admin
Posts: 4010
Joined: Fri Jun 18, 2004 12:03 am
Location: British Virgin Islands
Contact:

Sun Aug 08, 2004 4:41 pm

Well... First of all not every nForce3 mobo would work for you. Elder 150 models do not have GbE integrated and have also slow 600MHz HyperTransport making them "worst buy" for Athlon 64. Modern nForce3 250 w/o "Gb" index have only 10/100 Mb MAC so you need to look for exact match "nForce3 250 Gb".

Yes, ASUS boards have PCI 3Com Marvel rather then CSA (I have no idea why ASUS choice came to kind of expansive 3Com controller instead of using PHY chip for paired Intel CSA. I guess some marketing stuff...).

On my machines both IPerf and NTTtcp shows around 980-985 Mbits. Also we'll soon publish our recommendations -- how to "tune" TCP stack settings for SAN environments. Default configuration is far from being accepted as perfect.

Yes, file level access is limited with 20-25 MBps with MS SMB implementation. Block level access together with NTFS can go up to 60 MBps for the same set of files (DO file sizes -- larger ones seems to show better results -- excellent choice for media streaming). Just switching NAS -> SAN.
Regards,
Anton Kolomyeytsev

Chief Technology Officer & Chief Architect, StarWind Software

Image
Codiak

Wed Nov 24, 2004 8:38 am

Anton,
I finally got around to building a system that is showing ~950Mb a second via iperf. It takes playing with the tcp_window_size but you know this ;)

You mentioned that you were going to publish a tuning guide and I was wondering if it is available. I'd rather not re-invent the wheel.


Thanks
Chuck

PS: Machine A is a P4-2.6ht(865) Machine B is an AMD64 3200 (nForce3)
User avatar
anton (staff)
Site Admin
Posts: 4010
Joined: Fri Jun 18, 2004 12:03 am
Location: British Virgin Islands
Contact:

Thu Nov 25, 2004 8:29 am

Congratulations! :-) Now you've nearly duplicated my own test machines I'm running right now. I've also identical IPerf and NTTtcp results.

Unfortunately we were not able to catch up with the documentation. Guys are busy working on releasing major docs completely redesigned documentation closer to the end of this year. I hope this time "performance optimizations guide" would not be rescheduled. However it looks like you do not need it any more :-)
Codiak wrote:Anton,
I finally got around to building a system that is showing ~950Mb a second via iperf. It takes playing with the tcp_window_size but you know this ;)

You mentioned that you were going to publish a tuning guide and I was wondering if it is available. I'd rather not re-invent the wheel.


Thanks
Chuck

PS: Machine A is a P4-2.6ht(865) Machine B is an AMD64 3200 (nForce3)
Regards,
Anton Kolomyeytsev

Chief Technology Officer & Chief Architect, StarWind Software

Image
Codiak

Sat Nov 27, 2004 11:22 am

I thought I'd share some of my performance test results.
Though the changes seem minor they do yield some nice results.

Test systems

Client (StarPort)
- Pentium 4 2.6
- Intel D865BF motherboard
- Intel 1000 GB Nic
- Windows XP Pro
- Maxtor Raptor Sata Drive as Root
- HDS 200GB ATA-100

Server (StarWind)
- AMD 64 3200
- MSI K8N Neo Platium
- NVidia nForce3 Nic
- Windows XP Pro
- Raid 0, 2 Maxtor 80GB Sata Drives as Root
- Raid 0, 2 Seagate Baracudas 160GB Sata Drives (Image0)

Network was a simple crossover cable eliminating the overhead of a switch.

IPerf and HD Tach were used for tuning.

I started with fresh installs for both machines, and left the default network settings.

Using IPerf to begin testing the network, I found I was only transmitting at 250Mb per second, at a quarter of the expected throughput I didn't bother with HD Tach.

On the both machines I enabled Jumbo Packets. Intel's setting was 9014, and offered a setting at near 16k... I used the 9014 setting as NVidia for some reason only offered a 9000 byte configuration.

This resulted in a throughput of 320Mb/second.

Next I changed a setting on the NVidia card to maximize for throughput instead of the default (CPU).

The result was a 33% increase in CPU usage but throughput of 508Mb.

Not ready to quit, I played with the tcp window size ,-w commandline option for iperf.
- at 8k, throughput 508Mb , CPU 30% (default)
- at 16k, throughput 626Mb, CPU 35%
- at 32k, throughput 870Mb, , CPU 48%
- at 64k. throughput 939Mb, CPU 60%
- at 128k, throughput 940Mb, CPU 68%
- at 256k, throughput 942Mb, CPU 72%
- at 512k, throughput 942Mb, CPU 74%


Statisfied with these results I set HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services
Tcpip\Parameters\TcpWindowSize as a dword and a value of 65536
(the most efficient cpu/tcp as shown above)

At 3 times my orginal throughput numbers It was now time to test iSCSI throughput.

I tested a my RAID 0, 2xSeagate drive locally to determine my theorical maximum throughput... which was 95.3MB/sec Average and 246MB/sec Burst.

I then ran the same tests remotely... resulting in 52MB/sec Average and 78.2MB/sec burst.

A remote RAM disk has an average throughput of 77MB/sec with a burst of 91.5MB/sec


Since the local access of the Raid system is greater then the burst for a remote Ram disk, a difference of 39.5MB/sec , I conclude that this represents the overhead of the physical disk IO and network bandwidth is still available for exploitation.
At this point I break the Raid 0 configuration and create to JBODs of of 160GB each.... for some reason my Nvidia motherboard wouldn't show the drives seperatly without the JBOD config (I need to investigate this further)... however for the purposes of this test we can ignore this behavior.

What I was interested in was the performance of EACH drive seperately... after reconfiguration
- Drive A average
- Locally accessed throughput was 48.1MB/sec with a burst of 133.4MB /sec
- Remotely accessed throughput was 42.9MB/sec with a burst of 67.9MB /sec
- Drive B average
- Locally accessed throughput was 48.1MB/sec with a burst of 133MB /sec
- Remotely accessed throughput was 42.6MB/sec with a burst of 68.2MB /sec

Strange that the percentage overhead doesn't remain constant between the RAID 0, and JBOD configurations. At this point I can't explain the discripency but must question the previous conclusion wrt to disk IO overhead... something else appears to be happing..

I figured that since Windows can do software striping I could mount Drive A and Drive B via iSCSI and stripe them as Software Raid 0. It was my hope that the extra network bandwidth could be utilized to gain better disk throughput. Unfortunately this didn't work... not sure why... perhaps it's because the disks were image files, but WIndows couldn't convert the disks to dynamic....

I'll save that test for another day.


Bottom line... 50MB/s average throughput... Though no where near the performance my EMC San, is respectible and I have a feeling I'm just scratching the surface.


I look forward to hearing other ideas and results!




Chuck

[/list]
Guest

Sun Nov 28, 2004 10:30 pm

As I've told I have the same configuration now when swapped Intel Bonanza i875P P4 2.4GHz based machine for nVidia nForce3 250Gb with AMD Athlon 64 3400+. Things did not changed a lot, now I have approx. 950 Mbps in single direction instead of 980 Mbps I had before. You may apply such a changes to TCP stack parameters:

1) Increase TCP windows size from 64K to 20M (0x01400000)
2) add Tcp1323Opts setting of value 3 (DWORD) in the same key

In my configuration I also have 9K jumbo frames behaving in the best way. I've tried 16K with Intel but performance went down :-(

For now I'm still able to get around 150 megabytes in full-duplex mode over the network (IOmeter on client and RAM disk emulation configured on StarWind server). You may try to repeat the results :-)
User avatar
anton (staff)
Site Admin
Posts: 4010
Joined: Fri Jun 18, 2004 12:03 am
Location: British Virgin Islands
Contact:

Sun Nov 28, 2004 10:38 pm

Whoops.. Last post was from me :-)
Regards,
Anton Kolomyeytsev

Chief Technology Officer & Chief Architect, StarWind Software

Image
Locked