I/O errors after synchronization failure

Software-based VM-centric and flash-friendly VM storage + free version

Moderators: anton (staff), art (staff), Max (staff), Anatoly (staff)

Post Reply
Andre
Posts: 4
Joined: Sun May 29, 2016 5:12 pm

Sun May 29, 2016 5:25 pm

This morning, a (currently) unknown problem caused all the synchronization connections to fail between 2 Starwind SANs. These synchronization connections run on separate NICs from the iSCSI traffic.

In a situation like that, storage traffic should not be affected. However, from the moment the synchronization failure occurred, all VMs that are running on it, were getting I/O errors.

Initially, to deal with this, we manually disconnected the iscsi sessions of the secondary SAN, to make sure that all I/O would go to the primary SAN. But this didn't make any difference.

After this, we (manually) marked the target on the primary SAN as "synchronized". At this point, immediately all VMs were able to connect to the target again without a problem. And also the secondary SAN started synchronizing again, which also completed successfully.

So now my question is: why would the primary SAN not provide I/O to the VMs, when the SANs are not in sync? I mean, it makes sense to me that the secondary SAN (which is not in sync) wouldn't work in that kind of situation.. but the primary SAN should just continue to work as normal. Am I missing anything, or could this be a bug?
Dmitry (staff)
Staff
Posts: 82
Joined: Fri Mar 18, 2016 11:46 am

Mon May 30, 2016 8:53 am

Hello Andre,

Could clarify, what StarWind version you are using?

You can find this information in the StarWind Management console in the top drop-down menu, Help, About StarWind Management Console.

Thank you.
Andre
Posts: 4
Joined: Sun May 29, 2016 5:12 pm

Mon May 30, 2016 9:04 am

Version 5.8.2013

I'm aware that this version is outdated (and isn't officially supported anymore), and we will upgrade it to a newer version in the future, but at the moment this is not an option yet. As such, I hope you can address my question anyway.
Andre
Posts: 4
Joined: Sun May 29, 2016 5:12 pm

Mon May 30, 2016 3:40 pm

While working on something else, I just found this: https://knowledgebase.starwindsoftware. ... -blackout/

This behavior sounds similar to what we were experiencing: it seemed like all incoming connections were getting blocked, until we manually marked one of the two SANs as synchronized.

This does make sense in case of a total blackout, but not in our situation. In our situation, shouldn't the primary SAN just continue operating? I mean.. there's a reason there's a primary and a secondary..
Dmitry (staff)
Staff
Posts: 82
Joined: Fri Mar 18, 2016 11:46 am

Tue May 31, 2016 9:04 am

In case if your had power outage, there is no difference which node is primary, which is secondary.
If one node would powered off in a few minutes, StarWind would know that she has the latest data, and after recovery, she would retrieve all connections, because her status would be synchronized.
And after synchronization process would be finished, you would have connections from both nodes.

Any way, I highly recommend you to upgrade your StarWind to the latest version.
How to upgrade, you can find here: https://knowledgebase.starwindsoftware. ... d-version/
New build you can find here: http://www.starwindsoftware.com/customer-page

Thank you.
Andre
Posts: 4
Joined: Sun May 29, 2016 5:12 pm

Tue May 31, 2016 9:14 am

Dmitry (staff) wrote:In case if your had power outage, there is no difference which node is primary, which is secondary.
If one node would powered off in a few minutes, StarWind would know that she has the latest data, and after recovery, she would retrieve all connections, because her status would be synchronized.
And after synchronization process would be finished, you would have connections from both nodes.
I understand, but none of that is relevant here. Both nodes remained online, but for some reason, all the (4) synchronization connections between 1 pair of targets went offline. What's also a bit weird, is that the other 3 target pairs, remained online without any problem.

So what happened is:

- Both nodes remained online
- The synchronization between one pair of targets (storage01 and storage01p) went offline (we don't know why, but that's also not the point here)
- At this point, all I/O to this pair of targets stopped completely, probably because both targets were marked as "not synchronized"
- When we marked storage01p as synchronized, I/O came available again, and also storage01 started a full resynchronization
- All the other target pairs (storage02/storage02p, storage03/storage03p, storage04/storage04p) remained unaffected all the time

So essentially the question is: why did both storage01 and storage01p get marked as "not synchronized" (causing the I/O to get blocked)? It doesn't make any sense to me that this caused all of the I/O to get interupted in this scenario, because that just seems totally unnecessary.

Additional note: only the synchronization connections were unavailable, the iSCSI connections were fine. It really looks like the problem was that both storage01 and storage01p were marked as "not synchronized", while that just seems totally unnecessary.
Dmitry (staff)
Staff
Posts: 82
Joined: Fri Mar 18, 2016 11:46 am

Tue Jun 07, 2016 8:34 am

This is expected behaviour. Whenever you shutdown both StarWind storage controller nodes, they will come up with all StarWind devices in "Non-Synchronized" state.
Above mentioned state applies only when you shutdown both nodes and that mechanism exists because StarWind boxes do not know which node has the most recent data.
In order to make StarWind storage available to the Client you need to "mark as synchronized" devices that have the most recent data on them.

In the case where this mechanism does not exist, there could be a situation where you shutdown node 01 then, one week later you shutdown the second node. After bringing them both online, devices on node 01 would automatically become the synchronization source and all your data would become 1 week old.

Please let us know if you have any questions left.

Thank you.
Post Reply