Slow recovery of big LSFS storage after downtime

Software-based VM-centric and flash-friendly VM storage + free version

Moderators: anton (staff), art (staff), Max (staff), Anatoly (staff)

Post Reply
robnicholson
Posts: 359
Joined: Thu Apr 14, 2011 3:12 pm

Wed Sep 24, 2014 5:01 pm

Allied with all the tests about automatic resync after node down time, I've stopped and re-started the StarWind service on one node. Automatic resync has indeed kicked in but it was taking a lot longer than last time I tried this. However, this time, my LSFS storage is much bigger (275GB):
sshot-60.png
sshot-60.png (52.13 KiB) Viewed 2599 times
Upon checking resource monitor, StarWind appears to be reading every single SPSPX file one by one:
sshot-61.png
sshot-61.png (35.09 KiB) Viewed 2599 times
I assume it's carrying out some kind of integrity check? This will take a reasonable amount of time in my lab but in production with large LSFS disks (say 20TB), then this process I assume could take a pretty long time - hours certainly. During this time, HA won't be working and one assumes read performance on that node is impacted for any other storage that isn't in this device status.

I think this a) needs documenting so people don't panic and b) the status message says "Device status: Creating" which could possibly be phrased better like "Checking integrity"

Cheers, Rob.
robnicholson
Posts: 359
Joined: Thu Apr 14, 2011 3:12 pm

Wed Sep 24, 2014 5:17 pm

More information on unusual activity on automatic synchronisation. I was watching the Resource Monitor carefully as the "Creating" state went through all of the SPSPX files and re-read them. After it had done this, it's gone into "fast synchronisation" and has been for about the last five minutes.

Why is it doing this? The nodes were in perfect sync before and all I did was this:
  1. No iSCSI initiators connected to targets for last hour so *no* new data been written to the nodes
  2. Stopped StarWind service on node #1
  3. A minute or so later, restarted the service
  4. Auto-resync started and node #1 re-read all the SPSPX files
  5. After that finished, fast synchronisation is busy creating lots of new 5GB SPSPX files
It shouldn't be doing this. NOTHING accessed node #2 whilst node #1 was down - no targets connected. It was totally idle and everything will be been flushed ages ago. It's not the write-back cache as that's only 128MB big in my lab.

Here's something else that doesn't make sense. During fast sync, node #1 (the node that was down) is merrily writing new SPSPX files around a count of high 50s and above. Kind of expected as it's busy writing something it really shouldn't so new blocks will be added to the end.

However, node #2 (the node that was always up) is reading from the low SPSPX numbers, e.g. 4, 5, 6 etc. Why is it doing this? Those files were created ages ago. All I did was stop and start the service on a HA storage that was idle.

Cheers, Rob.
robnicholson
Posts: 359
Joined: Thu Apr 14, 2011 3:12 pm

Wed Sep 24, 2014 5:22 pm

PS. I typed that entire last post whilst waiting for "fast" synchronisation to complete. It's only got to 13% after 15+ minutes. Something is definitely amiss here - all caches will have been flushed and replication was in sync. On restarting the service, zero bytes should of been out of sync and so far an extra 40GB has been written to node #1. All I did was restart the StarWind service! :?

Cheers, Rob.
User avatar
Anatoly (staff)
Staff
Posts: 1675
Joined: Tue Mar 01, 2011 8:28 am
Contact:

Mon Sep 29, 2014 9:23 pm

LSFS performance is one of the thing that will be improved with the next minor update (ETA 1-2 week)
thanks for the feedback though
Best regards,
Anatoly Vilchinsky
Global Engineering and Support Manager
www.starwind.com
av@starwind.com
Post Reply