BUG: Incredibly high latency on HA luns in version 7509

epalombizio · Thu Jan 22, 2015 4:03 pm

Recently upgraded to version 7509 to solve an issue related to intermittent high latency with the previous version of v8, and the newest version increase latency immensely.

From storage latency of up to 40ms, to latency now in the 1100ms range.. only change was moving from v6 to v8.

I already have a case open on this, but I want to share this with anyone thinking of moving to v8 and using HA.. I would NOT do it yet until this gets resolved.

epalombizio · Fri Jan 23, 2015 4:23 am

When trying to restart the Starwind service on this HA member to bypass caching for HA members as directed by support, the service would not stop in a timely fashion.. I waited until I saw no more IO on the system before having to kill the service. It took at least 15 minutes to be able to stop the service.

I assume this is related to the latency issue. Anyone else seeing this as well?

epalombizio · Fri Jan 23, 2015 4:34 am

I've attached a .jpg what latency looks like before and after applying the workaround, which is disabling L1 cache on the HA LUNs. Note the fix being applied around the 11:20pm mark.

Simmo · Fri Jan 23, 2015 11:37 am

We also are seeing this.

Since the L1 caching is a major feature of the product I hope this will be fixed soon.

Fri Jan 30, 2015 12:31 pm

Thanks for sharing that! R&D is aware of that and we are expecting for hte fix in two-three weeks.

Thank you for your patience.

Dillon · Mon Feb 02, 2015 4:52 pm

I've had the same issue since November. They were very helpful in helping me identify the problem through support, I just havent pushed the issue because its so fast even without caching. Would be nice to use my RAM and SSD's though

lohelle · Mon Feb 02, 2015 5:32 pm

How do I disable L1?

Mon Feb 02, 2015 5:40 pm

Gotcha! Well, you all will have an updates once we`ll have an update you will be notified.

epalombizio · Mon Feb 02, 2015 6:50 pm

lohelle wrote:How do I disable L1?

In the Starwind.cfg file, look for instances of "wb" or "wt", corresponding to each cache type, and replace with "none"

This only needed to be done for LUNs that were in an HA configuration.

Enjoy

Tue Feb 03, 2015 8:22 pm

I can confirm that this is correct.
Elvis, thank you for answering this

lohelle · Tue Feb 03, 2015 8:26 pm

Thanks!

I hope the fix will be available soon. Most of my datastores are on SSDs, but I expect that my single large SATA/HDD-LUN will be very slow without cache.

lohelle · Tue Feb 03, 2015 10:36 pm

Actually, after setting the cachemode to "none" on all the luns (one of the nodes), some LUNS is still reporting WB-cache in the GUI. I also see that Starwind is using 2-3 GB if RAM.
That was very strange.. I have doublechecked the cfg-file..

lohelle · Tue Feb 10, 2015 7:37 am

Well.. I was burned again with this error.

I changed the cache setting only on one node, because I would need a full sync after restart because the service will not stop properly on its own.
This night my production environment went down again. I had to restart all vSphere hosts, as I could not rescan the sw iscsi adapter (another scan was already in progress or something like that).

Its working now, with cache, but only on one host. Its better right now to run with a single node with cache than 2 without, as I have a few LUNS with regular spinning HDDs.

I hope the fix will be out soon!

Is this error present in all v8 editions? Is a downgrade (prior v8-version) possible?

epalombizio · Tue Feb 10, 2015 2:37 pm

I'll let SW answer the downgrading question, but the reason I upgraded was due to a WRITE latency message I was continually receiving in the previous version. I had installed the new version to fix that issue, but now the latency is actually happening as opposed to just reporting it's happening.

This may or may not be related to SW, but in my vSphere environment, I've opted to stick to static discovery as opposed to dynamic discovery with the software iscsi initiator. I find that it improves the boot times of my vsphere hosts from 20 minutes down to 5 minutes. This also drastically improved my rescan times in vSphere.. Might be worthwhile for you.

+1 on getting this fix out soon. I'm paying for the HA product, but am only able to use the free one..

lohelle · Tue Feb 10, 2015 4:25 pm

Yup, static will speed up rescan/boot a lot, but I have to many LUNS and paths to use static I think.. Might write a script to handle it maybe..