Slow recovery of big LSFS storage after downtime

robnicholson · Wed Sep 24, 2014 5:01 pm

Allied with all the tests about automatic resync after node down time, I've stopped and re-started the StarWind service on one node. Automatic resync has indeed kicked in but it was taking a lot longer than last time I tried this. However, this time, my LSFS storage is much bigger (275GB):

: sshot-60.png (52.13 KiB) Viewed 21294 times

Upon checking resource monitor, StarWind appears to be reading every single SPSPX file one by one:

: sshot-61.png (35.09 KiB) Viewed 21294 times

I assume it's carrying out some kind of integrity check? This will take a reasonable amount of time in my lab but in production with large LSFS disks (say 20TB), then this process I assume could take a pretty long time - hours certainly. During this time, HA won't be working and one assumes read performance on that node is impacted for any other storage that isn't in this device status.

I think this a) needs documenting so people don't panic and b) the status message says "Device status: Creating" which could possibly be phrased better like "Checking integrity"

Cheers, Rob.

robnicholson · Wed Sep 24, 2014 5:17 pm

More information on unusual activity on automatic synchronisation. I was watching the Resource Monitor carefully as the "Creating" state went through all of the SPSPX files and re-read them. After it had done this, it's gone into "fast synchronisation" and has been for about the last five minutes.

Why is it doing this? The nodes were in perfect sync before and all I did was this:

No iSCSI initiators connected to targets for last hour so *no* new data been written to the nodes
Stopped StarWind service on node #1
A minute or so later, restarted the service
Auto-resync started and node #1 re-read all the SPSPX files
After that finished, fast synchronisation is busy creating lots of new 5GB SPSPX files

It shouldn't be doing this. NOTHING accessed node #2 whilst node #1 was down - no targets connected. It was totally idle and everything will be been flushed ages ago. It's not the write-back cache as that's only 128MB big in my lab.

Here's something else that doesn't make sense. During fast sync, node #1 (the node that was down) is merrily writing new SPSPX files around a count of high 50s and above. Kind of expected as it's busy writing something it really shouldn't so new blocks will be added to the end.

However, node #2 (the node that was always up) is reading from the low SPSPX numbers, e.g. 4, 5, 6 etc. Why is it doing this? Those files were created ages ago. All I did was stop and start the service on a HA storage that was idle.

Cheers, Rob.

robnicholson · Wed Sep 24, 2014 5:22 pm

PS. I typed that entire last post whilst waiting for "fast" synchronisation to complete. It's only got to 13% after 15+ minutes. Something is definitely amiss here - all caches will have been flushed and replication was in sync. On restarting the service, zero bytes should of been out of sync and so far an extra 40GB has been written to node #1. All I did was restart the StarWind service!

Cheers, Rob.

Mon Sep 29, 2014 9:23 pm

LSFS performance is one of the thing that will be improved with the next minor update (ETA 1-2 week)
thanks for the feedback though