Correct procedure after planned downtime

robnicholson · Mon Sep 22, 2014 12:21 pm

So what *exactly* you'd like to see in terms of automatic recovery for absolutely broken shutdown

I think first we should concentrate on getting a clean shutdown and power-up working first.

Cheers, Rob.

robnicholson · Mon Sep 22, 2014 1:41 pm

LSLF storage re-created and same test carried out. StarWind v8 does not recover from a planned and controlled power down and power up. You have to manually mark one of the nodes as synchronised and then it jumps back into action.

Cheers, Rob.

Mon Sep 22, 2014 2:45 pm

Engineers will proceed on that with you. I'm lost @ this point.

robnicholson wrote:LSLF storage re-created and same test carried out. StarWind v8 does not recover from a planned and controlled power down and power up. You have to manually mark one of the nodes as synchronised and then it jumps back into action.

Cheers, Rob.

robnicholson · Mon Sep 22, 2014 2:53 pm

Interesting, I noticed that I wasn't on the latest version in the lab so just upgraded node #1 from build 6884 to build 7185 and in this instance after the service is restarted, fast synchronisation is actually running, i.e. I didn't have to do anything.

Cheers, Rob.

Tue Sep 23, 2014 8:56 am

Thanks for the heads up. Please Keep us updated.

Wed Sep 24, 2014 1:55 pm

OK, so this is how it works now. After all cluster nodes were down and now are up StarWind is ...

... trying to see do any of the nodes have recent data (log with transactions kept on every node) *AND* that data is in integral state (caches flushed properly so there are no partial transactions). If YES then StarWind does automatic sync and powers up virtual LUNs so they can serve customers. If NO then StarWind waits for operator so human manually can point to the node to assign "synchronized" status (incomplete writes are purged) and start synchronization and LUN power up process from.

Problem: If caches were not flushed and were purged automatic process never starts and StarWind waits for operator intervention. Customers don't understand this and go away with a major confusion StarWind has NO automatic recovery (which is not true).

Solution: a) Tell to customer (System Event Log Message? StarWind Notification?) that technically scheduled and selected "automatic" recovery is not going to start because of ... and b) Allow customer to have a "final resort" or ability to put "default" node to start synchronization from if StarWind cannot detect who has the most recent version of data. Obviously b) should be made OPTIONAL as there's a risk to have quite a lot of data purged.

Is this OK for everybody? As we'll include this dual actions listed as a) and b) into some of the upcoming next builds.

robnicholson · Wed Sep 24, 2014 4:44 pm

OK, so this is how it works now. After all cluster nodes were down and now are up StarWind is ...

Hmm sort of but the devil is in the detail here. Depends what you mean by all cluster nodes were down? I'm not talking here about stopping & starting StarWindService alone here. I talking about taking the cluster nodes down by shutting down the Windows server via the normal mechanism.

... trying to see do any of the nodes have recent data (log with transactions kept on every node) *AND* that data is in integral state (caches flushed properly so there are no partial transactions).

In my lab, yes this is the case because there are *no* iSCSI initiators connected to the targets and the system has been left for many minutes before shutting down. Therefore nothing has been written or could be written for 5+ minutes. I therefore assume any residual sync & caches has been flushed?

If YES then StarWind does automatic sync and powers up virtual LUNs so they can serve customers.

Sorry that does not happen. I will try and generate a video from the lab.

Cheers, Rob.

robnicholson · Thu Sep 25, 2014 5:48 pm

I've produced a video to demonstrate this issue as I'm not sure you believe me!

https://dl.dropboxusercontent.com/u/366 ... Wind01.wmv

Timeline:

00:00 Two node StarWind cluster already exists with no storage
00:04 Create Storage1 as 5GB of thick HA storage on both nodes
00:36 Wait for 2nd node to synchronise - after about 3 minutes, sync finished (separate query over sync time). I'm watching it reading storage1.img
02:28 Synchronised according to the display - watch resource meter until read of storage1.img has finished
03:05 Showing that there are no targets connected except those used by StarWind itself for sync and heartbeat
03:20 Switch to 2nd node and check everything looks okay
03:23 Clean shutdown of UKMAC-SAN91 (2nd node)
03:48 Node #1 reports it's lost connection to node #2
04:00 Clean shutdown of UKMAC-SAN90 (1st node)
04:27 Restart both nodes (recording paused here to wait for power-up)
04:55 Logon to node #1 - errors about StarWind connections but expected as service is on delayed start-up
06:40 Wait for the services on both nodes to start
07:40 Connect to both nodes
07:50 Both nodes offline, not accepting connections (not that anyone has ever connected in this demo)
08:20 One has to manually mark node #1 as synchronised (the manual step!)
09:00 Waiting for full synchronisation to occur - WHY full? You saw - nothing was written ever to this storage and it was in sync at 02:28. It took another 3 minutes roughly to sync (I paused recording) and that was for just 5GB. Consider how long a 5TB device would have taken... I'm guessing because I had to to a manual mark as synchronised.

Not riveting viewing but hopefully this shows the problem. Having to carry out the manual step at 08:20 is the crux of the problem. The above sequence could have easily been triggered by a power outage with controlled shutdown and power-up via the UPS system.

Also, it demonstrates pretty slow initial replication and re-replication after manual marking as synchronised. My lab isn't very fast but the disks & network can transfer faster than 3 minutes for 5GB.

So I hope this shows that the cluster was in a clean state with empty caches before show down (nothing was every written to the storage to be accurate) and that automatic resync DID NOT OCCUR.

Cheers, Rob.

robnicholson · Thu Sep 25, 2014 5:49 pm

BTW - this video was recorded using Microsoft's ScreenRecorder which doesn't generate key frames very often (to keep the file size small) so fast forwarding through the video isn't that easy.

Cheers, Rob.

robnicholson · Fri Sep 26, 2014 9:50 am

arinci wrote:
anton (staff) wrote:I've discussed the situation with developers and it looks like there's a major confusion as my information was a bit outdated:
I'd like to propose my idea, just my 2 cents, on how to manage this kind of situation. You wrote that each node sharing the storage is able to log the time of the last update. In this way each node knows, after having received relevant info from the others participants, which node keeps the last update of the shared storage...this is good...but what happen if all nodes are switched off, in different moment and then they are restarted? I'm thinking of two different scenario:
1 - all the nodes are switched ON at the same time: in this way they are able to talk each other and rebuild the shared storage properly
2 - nodes are switched ON with different timing: in this way when the Starwind service start, it's not aware if the local image keeps the last update or no...so the only good thing to do is wait the other node(s) before making available the HA storage via ISCSI interface. What to do if a server refuse to start? Some manual intervention is needed to mark as synchronized one of the available participants of the shared storage.

My suggestion is the following, in case that a full automatic restore is desired in every case:
1) When the Starwind service start it waits for all the participants to be ready, for a defined amount of time (user configurable, let's say 5 minutes).
2) If all the nodes became available in time, then the shared storage can be properly synced and started. In case that not all the nodes became available in time then the available nodes select the node that keeps the last update...
I'm aware that this kind of automatic logic may be dangerous for certain use, for this I suggest to have an option to enable/disable this feature.

What do you think? It's just a dream or do you think there is a chance that this dream became true?

Simon

Apologies for block quotes but I'd missed Simon's response and it's worthy of further consideration. My suspicion here is that the different between one node going down & up (in which case auto-sync appears to work) and both nodes going down and up (in the case of clean shut down/power-up by UPS when power fails) is that when the SANs come back up, the first one that starts first is unable to see the second node so assumes an unknown state.

Simon suggests in #1 some mechanism to overcome this and I think he's barking up the right tree. During a controlled power down sequence, the nodes should be able to communicate with each other as to who has the synchronised copy. Node #1 goes first - stops processing iSCSI, flushes all caches, marks itself as "Out of the loop" and powers down. Node #2 now knows that it is the SOLE kid on the block and may process a few more iSCSI requests until it receives it's own shutdown. It may carry on forever in the single node shutdown.

Power-up sequence can occur in any order: one node will come up before the other.

However, node #1 knows that's it's "offline/out of the loop". It knows there is a very high chance that it doesn't have the latest data. It does diddly squat until node #2 comes back online. It under no circumstances does it automatically start processing iSCSI I/O. It stays in this state forever until manual intervention. At some point, node #2 comes online. It knows (because of that handshaking above) that it's the sole kid on the block so immediately starts processing iSCSI requests. Node #1 notices node #2 is now there, synchronises and then starts accepting iSCSI I/O.

This also works in the reverse order whereby node #2 comes up first and then node #1.

If, due to unknown circumstances, node #2 has failed, there should be a clear message on node #1 saying "Unable to reach node #2 and I do NOT have the latest copy". Dire warnings about marking it as the master copy. But manual intervention is unavoidable here and one has to make the call whether to bring node #1 online manually and thus risk data loss or wait until you can fix node #2. StarWind can't help with that call.

The other scenario is planned downtime. This is where I asked for a sort of maintenance mode where the HA cluster is put into a special mode before manually shutting down both nodes. On reflection, that wasn't the right wish.

What may be better here is an option to do a controlled manual shut-down of the cluster whereby both nodes are fully synchronised, i.e. no targets connected and everything flushed. One node is still flagged as been the sole operator and the other offline - the same process as above happens. But if the sole operator node doesn't come back online for whatever reason, the message displayed on the offline node can be along the lines of "Node #2 is not responding and has the synchronised data. However the last shutdown was controlled and at that time both nodes were in complete sync". There are zillions of caveats here but what we're trying to do is give an indication of how up to date the offline copy is.

Thinking further, in the case of UPS shut down, both servers will most likely receive the shutdown message at roughly the same time. The handshake could work a bit like this:

BEGIN scenario #1
Node #1 - I'm being asked to shutdown but I'm going to hang fire for a minute to see if you are asked to shutdown too
---normal operation for a minute and node #2 isn't asked to shutdown--
Node #1 - okay, I'm flushing and powering down. Node #2 - you are the sole operator and I'm marking myself as offline
Node #2 - see you later!
END

BEGIN scenario #2
Node #1 - I'm being asked to shutdown but I'm going to hang fire for a minute to see if you are asked to shutdown too
---normal operation for a minute and node #2 IS asked to shutdown--
Node #2 - yikes! I'm being asked to shutdown too
Node #1 - okay, I've flushed and you are the sole operator and I'm marking myself as offline
Node #2 - gotcha, I've flushed too and I've noticed that neither of us have any iSCSI targets connected so you can also flag yourself as "pretty synchronised"

Node #1 - thanks, hope to see you later
END

The power-up process for all of these is identical. Sole operator node starts accepting iSCSI I/O immediately. Offline node does nothing until it is able to sync with sole operator node. In sole operator node doesn't come up, the IT engineer knows whether the offline node is "Definitely out of sync" or "Was in sync at the time of the controlled shutdown of both StarWind nodes". That will make their job a lot easier!

Cheers, Rob.

robnicholson · Fri Sep 26, 2014 9:51 am

PS. Apologies for teaching granny to suck eggs but I just want this to work

Also, appreciate split-brain is a risk but I think the above circumvents this.

Fri Sep 26, 2014 9:34 pm

1) Unexpected shutdown included.

2) Flushing gigabytes of cache does not happen immediately.

3) You've missed crucial "IF" statement.

robnicholson wrote:
OK, so this is how it works now. After all cluster nodes were down and now are up StarWind is ...
Hmm sort of but the devil is in the detail here. Depends what you mean by all cluster nodes were down? I'm not talking here about stopping & starting StarWindService alone here. I talking about taking the cluster nodes down by shutting down the Windows server via the normal mechanism.

... trying to see do any of the nodes have recent data (log with transactions kept on every node) *AND* that data is in integral state (caches flushed properly so there are no partial transactions).
In my lab, yes this is the case because there are *no* iSCSI initiators connected to the targets and the system has been left for many minutes before shutting down. Therefore nothing has been written or could be written for 5+ minutes. I therefore assume any residual sync & caches has been flushed?

If YES then StarWind does automatic sync and powers up virtual LUNs so they can serve customers.
Sorry that does not happen. I will try and generate a video from the lab.

Cheers, Rob.

Fri Sep 26, 2014 9:42 pm

Now at least you'll have clear statement why we've refused to to automatic power up.

robnicholson wrote:I've produced a video to demonstrate this issue as I'm not sure you believe me!

https://dl.dropboxusercontent.com/u/366 ... Wind01.wmv

Timeline:

00:00 Two node StarWind cluster already exists with no storage

00:04 Create Storage1 as 5GB of thick HA storage on both nodes

00:36 Wait for 2nd node to synchronise - after about 3 minutes, sync finished (separate query over sync time). I'm watching it reading storage1.img

02:28 Synchronised according to the display - watch resource meter until read of storage1.img has finished

03:05 Showing that there are no targets connected except those used by StarWind itself for sync and heartbeat

03:20 Switch to 2nd node and check everything looks okay

03:23 Clean shutdown of UKMAC-SAN91 (2nd node)

03:48 Node #1 reports it's lost connection to node #2

04:00 Clean shutdown of UKMAC-SAN90 (1st node)

04:27 Restart both nodes (recording paused here to wait for power-up)

04:55 Logon to node #1 - errors about StarWind connections but expected as service is on delayed start-up

06:40 Wait for the services on both nodes to start

07:40 Connect to both nodes

07:50 Both nodes offline, not accepting connections (not that anyone has ever connected in this demo)

08:20 One has to manually mark node #1 as synchronised (the manual step!)

09:00 Waiting for full synchronisation to occur - WHY full? You saw - nothing was written ever to this storage and it was in sync at 02:28. It took another 3 minutes roughly to sync (I paused recording) and that was for just 5GB. Consider how long a 5TB device would have taken... I'm guessing because I had to to a manual mark as synchronised.
Not riveting viewing but hopefully this shows the problem. Having to carry out the manual step at 08:20 is the crux of the problem. The above sequence could have easily been triggered by a power outage with controlled shutdown and power-up via the UPS system.

Also, it demonstrates pretty slow initial replication and re-replication after manual marking as synchronised. My lab isn't very fast but the disks & network can transfer faster than 3 minutes for 5GB.

So I hope this shows that the cluster was in a clean state with empty caches before show down (nothing was every written to the storage to be accurate) and that automatic resync DID NOT OCCUR.

Cheers, Rob.

Fri Sep 26, 2014 9:44 pm

That's fine. Let us do some homework here and I hope we'll keep everybody happy.

robnicholson wrote:PS. Apologies for teaching granny to suck eggs but I just want this to work Also, appreciate split-brain is a risk but I think the above circumvents this.

arinci · Thu Oct 30, 2014 10:45 am

anton (staff) wrote:

That's fine. Let us do some homework here and I hope we'll keep everybody happy.

robnicholson wrote:PS. Apologies for teaching granny to suck eggs but I just want this to work Also, appreciate split-brain is a risk but I think the above circumvents this.

Hello everyone, I'm opening again this discussion to understand if there is something new on this argument: I'd like to know a bit more from Starwind regarding the next steps and, possibly, a release date for the new feature we talked about.
My goal is to build a system with shared storage that works in every condition without need of manual intervention to synchronize storage.

More specifically I need to build a Ms cluster with a shared PostgreSQL database that works:
- after a failure of one of the 2 server (and this is not a problem, already works fine)
- after an unplanned shutdown/reboot.
For the second case the system must be able to work also when only one of the 2 server is restarted, including the worst condition when the only server that restarted successfully is the one that doesn't keep the last "image" of the shared storage.

All of this without manual intervention

I can develop some external logic that take care of all the conditions, i.e using StarWind API, but I need specific functions to read/write the "sync state" on the VSAN. Unluckily it's seems to me that these functions are not present in current build...any hope for next release(s)?

Thanks for your attention and patience!

Simon