Different number of LSFS files on the two nodes

robnicholson · Fri Sep 19, 2014 12:36 pm

See attached screenshots from the lab. That doesn't look right to me. Different numbers of storage1*.spspx files in the two nodes? These two nodes are in sync.

Cheers, Rob.

Fri Sep 19, 2014 10:52 pm

Actual content on the nodes should be the same, logical mapping can be different as space optimization process may kick in and wrong @ different priority. We'll double-check but as long as you don't see any data integrity issues please ignore. We'll put it into "known 'issues' please ignore". Thank you!

robnicholson · Mon Sep 22, 2014 1:50 pm

Won't be able to follow this up I'm afraid as I had to trash that storage as part of the other post about recovery after clean shut-down.

Cheers, Rob.

Mon Sep 22, 2014 2:44 pm

NP

If it's a real issue it will get back soon.

robnicholson · Mon Sep 22, 2014 4:21 pm

I've managed to recreate this but not sure if I can repeat as I've been doing a lot of messing around this afternoon with upgrading StarWind versions in the lab. But it was something along the lines of this:

Two HA nodes and one iSCSI initiator connected via two MPIO iSCSI channels (so four iSCSI initiator connections in total)
Started a large copy going on the server to simulate load on HA set-up during an upgrade (which is maybe dangerous but this is all lab work) - this was copying lots of files from our primary file server
Upgrade node #1 - during this time, everything was going to node #2
Node #1 re-synchronised itself after upgrade
Stopped the server, edited iScsiDiscoveryListInterfaces (see other post), restarted service
Node #1 re-synchronised again
Repeated this with node #2

So basically, ended up with different numbers of spspx on each node probably because one of the nodes was down whilst there was a lot of disk writes going on to the other node?

I assume having different numbers of spspx files is bad news?

Cheers, Rob.

robnicholson · Mon Sep 22, 2014 4:24 pm

Oooh - chkdsk in Windows Server 2012 R2 has changed with an ETA - neat!

robnicholson · Tue Sep 23, 2014 11:12 am

I'll check this again, but I'm pretty sure that this is occurring in "normal" operation. I've re-created the storage yesterday and copied a few GB of files via my test server and there are different numbers of spspx files on each server.

All I did I think this time was shutdown the test server and then the two nodes so I could snapshot the set-up in VMware Workstation.

Later...

robnicholson · Tue Sep 23, 2014 3:10 pm

Okay, this is repeatable...

Two HA nodes running
Windows server connected to LSFS storage via iSCSI
Clean shutdown Windows server and check iSCSI disconnected from both nodes
Shut down node #2
Shut down node #1
Power-up both nodes
Storage is offline on both nodes and doesn't automatically resync (linked with other post)
Before doing anything, both nodes have 4 x 5GB SPSPX files
Note: Windows server not brought up so nothing is connecting to the storage
Mark node #1 as synchronised
Node #2 starts a fast synchronisation from node #1
After finishing, node #2 has 6 x 5GB SPSPX files but node #1 still only has 4

Some comments:

Nothing has been written (or read) to either node via iSCSI in-between powering down and powering up
Therefore there should be nothing to synchronise
However, the sync in step 11 above takes about 5 minutes in the lab suggesting *something* is synchronising

So I suspect something *is* getting written to node #2 during re-sync which because of the nature of LSFS (never overwrite), possibly explains the extra 2 x 5GB added to node #2. However, two observations: firstly, the nodes were in perfect sync before power down/power up and therefore nothing actually technically needs synchronising. But even if there was something to re-synchronise, 2 x 5GB seems way too big considering there was only 4 x 5GB to begin with.

So three potential problems here:

The nodes were in perfect sync before and therefore nothing needed to sync but something has synchronised
Even if there was a little bit to sync, the amount synchronised/added to LSFS is far more than it should be (10GB extra on an original size of 20GB - that's 50%)
We've ended up with uneven number of SPSPX files on each node which doesn't feel right

Cheers, Rob.

robnicholson · Tue Sep 23, 2014 3:13 pm

NOTE: the above is with build 6884. I'm going to upgrade to 7185, re-create storage and try repeating.

Cheers, Rob.

robnicholson · Tue Sep 23, 2014 6:06 pm

Another observation: upgraded node #1 to latest build and obviously node #1 was offline during the upgrade. Post upgrade and node #1 is resynchronising. Why? There is nothing to resynchronise - the two nodes were in sync before the upgrade and there are no targets currently connected.

Cheers, Rob.

Wed Sep 24, 2014 8:19 am

They need to check content integrity before allowing you to export it.

robnicholson wrote:Another observation: upgraded node #1 to latest build and obviously node #1 was offline during the upgrade. Post upgrade and node #1 is resynchronising. Why? There is nothing to resynchronise - the two nodes were in sync before the upgrade and there are no targets currently connected.

Cheers, Rob.

robnicholson · Wed Sep 24, 2014 8:43 am

They need to check content integrity before allowing you to export it.

Possibly as the sync is faster. After a controlled power cycle, the nodes do not come back up and you have to do a manually mark one of the nodes as synchronised (see other thread) and when you do this, you end up with different number of LSFS chunks in each node (can demonstrate this remotely if needed). However, if you just shut down the service and then restart the service, it synchronised automatically (correct), doesn't take long and doesn't end up with different number of LSFS chunks.

So I'm perplexed why:

Power-down server, power-up server: manual re-sync is needed
Shut-down service, start service: automatic sync kicks in

There is practically not a lot of different between the two as powering down stops the service first and powering up starts it.

Cheers, Rob.

Wed Sep 24, 2014 1:44 pm

The root question (different number of files on different nodes) is expected behavior. Example: If only one node been active for some time, it got some reads and writes, but nothing deleted. Sync is only replicating actual data, not every byte on the storage. So, do not pay attention on it. Everything works just as planned.