BUG: Incredibly high latency on HA luns in version 7509

Software-based VM-centric and flash-friendly VM storage + free version

Moderators: anton (staff), art (staff), Max (staff), Anatoly (staff)

epalombizio
Posts: 67
Joined: Wed Oct 27, 2010 3:40 pm

Thu Jan 22, 2015 4:03 pm

Recently upgraded to version 7509 to solve an issue related to intermittent high latency with the previous version of v8, and the newest version increase latency immensely.

From storage latency of up to 40ms, to latency now in the 1100ms range.. only change was moving from v6 to v8.

I already have a case open on this, but I want to share this with anyone thinking of moving to v8 and using HA.. I would NOT do it yet until this gets resolved.
epalombizio
Posts: 67
Joined: Wed Oct 27, 2010 3:40 pm

Fri Jan 23, 2015 4:23 am

When trying to restart the Starwind service on this HA member to bypass caching for HA members as directed by support, the service would not stop in a timely fashion.. I waited until I saw no more IO on the system before having to kill the service. It took at least 15 minutes to be able to stop the service.

I assume this is related to the latency issue. Anyone else seeing this as well?
epalombizio
Posts: 67
Joined: Wed Oct 27, 2010 3:40 pm

Fri Jan 23, 2015 4:34 am

I've attached a .jpg what latency looks like before and after applying the workaround, which is disabling L1 cache on the HA LUNs. Note the fix being applied around the 11:20pm mark.
Attachments
Latency.jpg
Latency.jpg (251.49 KiB) Viewed 7965 times
Simmo
Posts: 1
Joined: Fri Jan 23, 2015 11:33 am

Fri Jan 23, 2015 11:37 am

We also are seeing this.

Since the L1 caching is a major feature of the product I hope this will be fixed soon.
User avatar
Anatoly (staff)
Staff
Posts: 1675
Joined: Tue Mar 01, 2011 8:28 am
Contact:

Fri Jan 30, 2015 12:31 pm

Thanks for sharing that! R&D is aware of that and we are expecting for hte fix in two-three weeks.

Thank you for your patience.
Best regards,
Anatoly Vilchinsky
Global Engineering and Support Manager
www.starwind.com
av@starwind.com
Dillon
Posts: 2
Joined: Fri Aug 29, 2014 3:36 pm

Mon Feb 02, 2015 4:52 pm

I've had the same issue since November. They were very helpful in helping me identify the problem through support, I just havent pushed the issue because its so fast even without caching. Would be nice to use my RAM and SSD's though :)
User avatar
lohelle
Posts: 144
Joined: Sun Aug 28, 2011 2:04 pm

Mon Feb 02, 2015 5:32 pm

How do I disable L1?
User avatar
Anatoly (staff)
Staff
Posts: 1675
Joined: Tue Mar 01, 2011 8:28 am
Contact:

Mon Feb 02, 2015 5:40 pm

Gotcha! Well, you all will have an updates once we`ll have an update you will be notified.
Best regards,
Anatoly Vilchinsky
Global Engineering and Support Manager
www.starwind.com
av@starwind.com
epalombizio
Posts: 67
Joined: Wed Oct 27, 2010 3:40 pm

Mon Feb 02, 2015 6:50 pm

lohelle wrote:How do I disable L1?
In the Starwind.cfg file, look for instances of "wb" or "wt", corresponding to each cache type, and replace with "none"

This only needed to be done for LUNs that were in an HA configuration.

Enjoy
User avatar
Anatoly (staff)
Staff
Posts: 1675
Joined: Tue Mar 01, 2011 8:28 am
Contact:

Tue Feb 03, 2015 8:22 pm

I can confirm that this is correct.
Elvis, thank you for answering this
Best regards,
Anatoly Vilchinsky
Global Engineering and Support Manager
www.starwind.com
av@starwind.com
User avatar
lohelle
Posts: 144
Joined: Sun Aug 28, 2011 2:04 pm

Tue Feb 03, 2015 8:26 pm

Thanks!

I hope the fix will be available soon. Most of my datastores are on SSDs, but I expect that my single large SATA/HDD-LUN will be very slow without cache.
User avatar
lohelle
Posts: 144
Joined: Sun Aug 28, 2011 2:04 pm

Tue Feb 03, 2015 10:36 pm

Actually, after setting the cachemode to "none" on all the luns (one of the nodes), some LUNS is still reporting WB-cache in the GUI. I also see that Starwind is using 2-3 GB if RAM.
That was very strange.. I have doublechecked the cfg-file..
User avatar
lohelle
Posts: 144
Joined: Sun Aug 28, 2011 2:04 pm

Tue Feb 10, 2015 7:37 am

Well.. I was burned again with this error.

I changed the cache setting only on one node, because I would need a full sync after restart because the service will not stop properly on its own.
This night my production environment went down again. I had to restart all vSphere hosts, as I could not rescan the sw iscsi adapter (another scan was already in progress or something like that).

Its working now, with cache, but only on one host. Its better right now to run with a single node with cache than 2 without, as I have a few LUNS with regular spinning HDDs.

I hope the fix will be out soon!

Is this error present in all v8 editions? Is a downgrade (prior v8-version) possible?
epalombizio
Posts: 67
Joined: Wed Oct 27, 2010 3:40 pm

Tue Feb 10, 2015 2:37 pm

I'll let SW answer the downgrading question, but the reason I upgraded was due to a WRITE latency message I was continually receiving in the previous version. I had installed the new version to fix that issue, but now the latency is actually happening as opposed to just reporting it's happening.

This may or may not be related to SW, but in my vSphere environment, I've opted to stick to static discovery as opposed to dynamic discovery with the software iscsi initiator. I find that it improves the boot times of my vsphere hosts from 20 minutes down to 5 minutes. This also drastically improved my rescan times in vSphere.. Might be worthwhile for you.

+1 on getting this fix out soon. I'm paying for the HA product, but am only able to use the free one..
User avatar
lohelle
Posts: 144
Joined: Sun Aug 28, 2011 2:04 pm

Tue Feb 10, 2015 4:25 pm

Yup, static will speed up rescan/boot a lot, but I have to many LUNS and paths to use static I think.. Might write a script to handle it maybe.. :)
Post Reply