Different number of LSFS files on the two nodes

Software-based VM-centric and flash-friendly VM storage + free version

Moderators: anton (staff), art (staff), Max (staff), Anatoly (staff)

Post Reply
robnicholson
Posts: 359
Joined: Thu Apr 14, 2011 3:12 pm

Fri Sep 19, 2014 12:36 pm

See attached screenshots from the lab. That doesn't look right to me. Different numbers of storage1*.spspx files in the two nodes? These two nodes are in sync.

Cheers, Rob.
Attachments
sshot-50.png
sshot-50.png (81.87 KiB) Viewed 14647 times
User avatar
anton (staff)
Site Admin
Posts: 4021
Joined: Fri Jun 18, 2004 12:03 am
Location: British Virgin Islands
Contact:

Fri Sep 19, 2014 10:52 pm

Actual content on the nodes should be the same, logical mapping can be different as space optimization process may kick in and wrong @ different priority. We'll double-check but as long as you don't see any data integrity issues please ignore. We'll put it into "known 'issues' please ignore". Thank you!
Regards,
Anton Kolomyeytsev

Chief Technology Officer & Chief Architect, StarWind Software

Image
robnicholson
Posts: 359
Joined: Thu Apr 14, 2011 3:12 pm

Mon Sep 22, 2014 1:50 pm

Won't be able to follow this up I'm afraid as I had to trash that storage as part of the other post about recovery after clean shut-down.

Cheers, Rob.
User avatar
anton (staff)
Site Admin
Posts: 4021
Joined: Fri Jun 18, 2004 12:03 am
Location: British Virgin Islands
Contact:

Mon Sep 22, 2014 2:44 pm

NP

If it's a real issue it will get back soon.
Regards,
Anton Kolomyeytsev

Chief Technology Officer & Chief Architect, StarWind Software

Image
robnicholson
Posts: 359
Joined: Thu Apr 14, 2011 3:12 pm

Mon Sep 22, 2014 4:21 pm

I've managed to recreate this but not sure if I can repeat as I've been doing a lot of messing around this afternoon with upgrading StarWind versions in the lab. But it was something along the lines of this:
  • Two HA nodes and one iSCSI initiator connected via two MPIO iSCSI channels (so four iSCSI initiator connections in total)
  • Started a large copy going on the server to simulate load on HA set-up during an upgrade (which is maybe dangerous but this is all lab work) - this was copying lots of files from our primary file server
  • Upgrade node #1 - during this time, everything was going to node #2
  • Node #1 re-synchronised itself after upgrade
  • Stopped the server, edited iScsiDiscoveryListInterfaces (see other post), restarted service
  • Node #1 re-synchronised again
  • Repeated this with node #2
So basically, ended up with different numbers of spspx on each node probably because one of the nodes was down whilst there was a lot of disk writes going on to the other node?

I assume having different numbers of spspx files is bad news?

Cheers, Rob.
robnicholson
Posts: 359
Joined: Thu Apr 14, 2011 3:12 pm

Mon Sep 22, 2014 4:24 pm

Oooh - chkdsk in Windows Server 2012 R2 has changed with an ETA - neat!
robnicholson
Posts: 359
Joined: Thu Apr 14, 2011 3:12 pm

Tue Sep 23, 2014 11:12 am

I'll check this again, but I'm pretty sure that this is occurring in "normal" operation. I've re-created the storage yesterday and copied a few GB of files via my test server and there are different numbers of spspx files on each server.

All I did I think this time was shutdown the test server and then the two nodes so I could snapshot the set-up in VMware Workstation.

Later...
robnicholson
Posts: 359
Joined: Thu Apr 14, 2011 3:12 pm

Tue Sep 23, 2014 3:10 pm

Okay, this is repeatable...
  1. Two HA nodes running
  2. Windows server connected to LSFS storage via iSCSI
  3. Clean shutdown Windows server and check iSCSI disconnected from both nodes
  4. Shut down node #2
  5. Shut down node #1
  6. Power-up both nodes
  7. Storage is offline on both nodes and doesn't automatically resync (linked with other post)
  8. Before doing anything, both nodes have 4 x 5GB SPSPX files
  9. Note: Windows server not brought up so nothing is connecting to the storage
  10. Mark node #1 as synchronised
  11. Node #2 starts a fast synchronisation from node #1
  12. After finishing, node #2 has 6 x 5GB SPSPX files but node #1 still only has 4
Some comments:
  1. Nothing has been written (or read) to either node via iSCSI in-between powering down and powering up
  2. Therefore there should be nothing to synchronise
  3. However, the sync in step 11 above takes about 5 minutes in the lab suggesting *something* is synchronising
So I suspect something *is* getting written to node #2 during re-sync which because of the nature of LSFS (never overwrite), possibly explains the extra 2 x 5GB added to node #2. However, two observations: firstly, the nodes were in perfect sync before power down/power up and therefore nothing actually technically needs synchronising. But even if there was something to re-synchronise, 2 x 5GB seems way too big considering there was only 4 x 5GB to begin with.

So three potential problems here:
  1. The nodes were in perfect sync before and therefore nothing needed to sync but something has synchronised
  2. Even if there was a little bit to sync, the amount synchronised/added to LSFS is far more than it should be (10GB extra on an original size of 20GB - that's 50%)
  3. We've ended up with uneven number of SPSPX files on each node which doesn't feel right
Cheers, Rob.
robnicholson
Posts: 359
Joined: Thu Apr 14, 2011 3:12 pm

Tue Sep 23, 2014 3:13 pm

NOTE: the above is with build 6884. I'm going to upgrade to 7185, re-create storage and try repeating.

Cheers, Rob.
robnicholson
Posts: 359
Joined: Thu Apr 14, 2011 3:12 pm

Tue Sep 23, 2014 6:06 pm

Another observation: upgraded node #1 to latest build and obviously node #1 was offline during the upgrade. Post upgrade and node #1 is resynchronising. Why? There is nothing to resynchronise - the two nodes were in sync before the upgrade and there are no targets currently connected.

Cheers, Rob.
User avatar
anton (staff)
Site Admin
Posts: 4021
Joined: Fri Jun 18, 2004 12:03 am
Location: British Virgin Islands
Contact:

Wed Sep 24, 2014 8:19 am

They need to check content integrity before allowing you to export it.
robnicholson wrote:Another observation: upgraded node #1 to latest build and obviously node #1 was offline during the upgrade. Post upgrade and node #1 is resynchronising. Why? There is nothing to resynchronise - the two nodes were in sync before the upgrade and there are no targets currently connected.

Cheers, Rob.
Regards,
Anton Kolomyeytsev

Chief Technology Officer & Chief Architect, StarWind Software

Image
robnicholson
Posts: 359
Joined: Thu Apr 14, 2011 3:12 pm

Wed Sep 24, 2014 8:43 am

They need to check content integrity before allowing you to export it.
Possibly as the sync is faster. After a controlled power cycle, the nodes do not come back up and you have to do a manually mark one of the nodes as synchronised (see other thread) and when you do this, you end up with different number of LSFS chunks in each node (can demonstrate this remotely if needed). However, if you just shut down the service and then restart the service, it synchronised automatically (correct), doesn't take long and doesn't end up with different number of LSFS chunks.

So I'm perplexed why:

Power-down server, power-up server: manual re-sync is needed
Shut-down service, start service: automatic sync kicks in

There is practically not a lot of different between the two as powering down stops the service first and powering up starts it.

Cheers, Rob.
User avatar
Anatoly (staff)
Staff
Posts: 1675
Joined: Tue Mar 01, 2011 8:28 am
Contact:

Wed Sep 24, 2014 1:44 pm

The root question (different number of files on different nodes) is expected behavior. Example: If only one node been active for some time, it got some reads and writes, but nothing deleted. Sync is only replicating actual data, not every byte on the storage. So, do not pay attention on it. Everything works just as planned.
Best regards,
Anatoly Vilchinsky
Global Engineering and Support Manager
www.starwind.com
av@starwind.com
Post Reply