5.7 Beta Upgraded to 5.7 - Losing HA Sync Now

Public beta (bugs, reports, suggestions, features and requests)

Moderators: anton (staff), art (staff), Max (staff), Anatoly (staff)

rchisholm
Posts: 63
Joined: Sat Nov 27, 2010 7:38 pm

Tue Jul 26, 2011 6:12 pm

I upgraded our 5.7 Beta HA SAN to the released version of 5.7. Now, when one of our HP BL465c G7 servers with NC551i iSCSI HBA's with MPIO RR setup running W2K8 puts a heavy write load on a target, the target loses sync. Sometimes it is even causing all of the targets on the SAN to lose sync and the management console to disconnect. It stopped doing it for a few hours yesterday after I applied the newest license file that came with our service renewal, so I thought that very strangely it may have fixed it. It is doing it again though. I didn't see this happening in 5.7 Beta, but testing from the new SQL blade servers was extremely limited before this week, so we may have had the issue before and I just didn't know it. Has anyone else run into a similar problem? The new SQL servers are still in testing, but I'm going to need to fix this before we go into production.
User avatar
Alex (staff)
Staff
Posts: 177
Joined: Sat Jun 26, 2004 8:49 am

Wed Jul 27, 2011 9:53 am

Thank you for message! We have found this issue a couple of days ago. The fix is ready, we are testing it now to update the release version.
Best regards,
Alexey.
rchisholm
Posts: 63
Joined: Sat Nov 27, 2010 7:38 pm

Thu Jul 28, 2011 2:13 pm

Any chance I'll have the fix before the weekend? I have a very busy schedule of testing next week and it would be quite helpful if I can upgrade the SANs over the weekend.
User avatar
Alex (staff)
Staff
Posts: 177
Joined: Sat Jun 26, 2004 8:49 am

Thu Jul 28, 2011 2:37 pm

I am not sure. Testing can take several days, because this issue is not easy to reproduce.
Best regards,
Alexey.
rchisholm
Posts: 63
Joined: Sat Nov 27, 2010 7:38 pm

Fri Jul 29, 2011 3:39 pm

I turned on flow control on the switches between the SANs and the 10 GbE flex connects in the blade systems a few days ago. We havent had it drop sync since then. Is it possible that is the fix for us? I'm going to beat the heck out of it over the weekend and see if I can get it to fail again.
rchisholm
Posts: 63
Joined: Sat Nov 27, 2010 7:38 pm

Mon Aug 01, 2011 1:05 pm

Flow control added to the switches seemed to fix the out of sync problems, but also seemed to cause pauses of a few seconds at a time and killed performance. I disabled flow control on the NIC's and the switches. Performance is back up to where it should be, and so far, I haven't knocked the iSCSI targets out of sync. I'm going to continue thrashing it to see what happens.
User avatar
Alex (staff)
Staff
Posts: 177
Joined: Sat Jun 26, 2004 8:49 am

Mon Aug 01, 2011 1:18 pm

Thank you for update!

The issue that caused sync loss is related to the time of getting answer from HA partner. Looks like playing with flow control affects the response time and so affects the appearance of the issue.
Best regards,
Alexey.
nbarsotti
Posts: 38
Joined: Mon Nov 23, 2009 6:22 pm

Mon Aug 01, 2011 3:14 pm

So if there is a known bug with 5.7 when will an updated build of 5.7 be released? I don't want to decrease my stability when I upgrade from 5.6. How long should I wait?
User avatar
Alex (staff)
Staff
Posts: 177
Joined: Sat Jun 26, 2004 8:49 am

Mon Aug 01, 2011 3:21 pm

No later than Thursday, August 4th.
Best regards,
Alexey.
nbarsotti
Posts: 38
Joined: Mon Nov 23, 2009 6:22 pm

Mon Aug 01, 2011 3:33 pm

Great, I will plan my upgrade to 5.7 on Friday afternoon, PDT.
rchisholm
Posts: 63
Joined: Sat Nov 27, 2010 7:38 pm

Mon Aug 01, 2011 3:47 pm

I don't know about other people's setups, but so far I'm seeing a nice performance increase and it appears to be stable with flow control turned off across the board. Of course, this iSCSI network is all HP 10GbE, so your milage definitely may vary. I've been thrashing it from multiple servers with combinations of reads and writes, and so far it's working great.
kmax
Posts: 47
Joined: Thu Nov 04, 2010 3:37 pm

Mon Aug 01, 2011 8:54 pm

With flow control on, did it affect max speed, or did you see more of a zig zag pattern? Meaning did it hit the max, then back down, then back up, etc. with it enabled?
rchisholm
Posts: 63
Joined: Sat Nov 27, 2010 7:38 pm

Mon Aug 01, 2011 9:06 pm

Zig-zag pattern, with some couple second pauses with no traffic. The HP NC522SFP+ NIC's in the servers really don't seem to play nicely with flow control. Before I upgraded the firmware on them in our DL380G7's running VSphere 4.1, the one's used for iSCSI on VSphere would actually shut down from the pause packet flooding and I would have to power the server off and back on to fix it.
User avatar
Bohdan (staff)
Staff
Posts: 435
Joined: Wed May 23, 2007 12:58 pm

Mon Aug 08, 2011 9:32 am

StarWind 5.7 was updated. It should solve the problem. Please let us know about the results.
rchisholm
Posts: 63
Joined: Sat Nov 27, 2010 7:38 pm

Sun Aug 14, 2011 9:27 pm

Upgraded to the latest version this weekend. Upgrading the 1st node went perfectly. Everything fast synced in a matter of minutes. 2nd node hung for a while during the startup of the service and seems to have caused about half of the targets to lose sync and need a full sync. I'm seeing over 12 Gb/s on the sync though with 2 10GbE NICs teamed. :shock:

We'll bang on it really hard this week and see what happens. I like where the performance is going. Can't wait until the SSD cache integration. I should have 48 Intel X25-E 32GB drives available for it.
Post Reply