Big DeSync - Full Cluster down

troy.j.parker · Sat Mar 28, 2020 1:40 pm

After a catastrophic failure, there was a massive desync.

This is a 2 node HA Cluster on Windows Server 2016, with a 6TB Starwind vSAN.
It has been working relatively flawlessly up until this point.
The Shared volume broke once the sync caused problems, as expected when the iSCSI connector went offline.
Marked the most recent node as synchronized, and then had everything resync.
The iSCSI connector came back up, but my Cluster Volume resource still reported it was down, and would not go online.

I verified that there was a sync status of 1 on both nodes. I tried to remove the volume resource from the Cluster Manager, and readd it, but it reports there is no available disks on either machine. Other threads mention removing a problematic node and re-adding it, but now the node I removed will not join back to the cluster, but that is not really important at this time.

I would be fine with it mounting to a single node so I can check the data consistency at this point.

Our entire HA infrastructure has been down for 2 days at this point, and I am at a loss for what to do next.

EDIT:

A little more info as I have trying to correct the issue:

Some developments.

Once the iSCSI connectors are in place, I can verify the shared disk shows in disk management on both nodes. However it mounts as 'RAW' instead of NTFS. Any interaction with the disk prompts a 'resource is busy' alert.

Event Viewer spits out the following when Mounted, or interacted with:

Code: Select all

 The system failed to flush data to the transaction log. Corruption may occur in VolumeId: E:, DeviceName: 
 \Device\HarddiskVolume13.
 ({Device Busy}
 The device is currently busy.)

And

Code: Select all

 A corruption was discovered in the file system structure on volume E:.

The exact nature of the corruption is unknown.  The file system structures need to be scanned online.

A chkdsk on the volume yields the following results:

Code: Select all

chkdsk /r E:
The type of the file system is NTFS.
Volume label is CSVOL1.

Stage 1: Examining basic file system structure ...
Deleting corrupt attribute record (0x80, "")
from file record segment 0x7F.
Deleting corrupt attribute record (0x80, "")
from file record segment 0xCD.
Deleting corrupt attribute record (0x80, "")
from file record segment 0xD4.
Deleting corrupt attribute record (0x80, "")
from file record segment 0xE1.
Deleting corrupt attribute record (0x80, "")
from file record segment 0x114.
Deleting corrupt attribute record (0x80, "")
from file record segment 0x18C.
Deleting corrupt attribute record (0x80, "")
from file record segment 0x18D.
Deleting corrupt attribute record (0x80, "")
from file record segment 0x193.
Deleting corrupt attribute record (0x80, "")
from file record segment 0x1A3.
Deleting corrupt attribute record (0x80, "")
from file record segment 0x1B4.
  512 file records processed.
File verification completed.
An unspecified error occurred (6e74667363686b2e 109f).
An unspecified error occurred (6e74667363686b2e 1583).

troy.j.parker · Sat Mar 28, 2020 2:12 pm

A little update, I see the Disk in Disk managment now, but it is RAW format. I am pretty sure this is not how it should be?

Sat Mar 28, 2020 11:15 pm

Update for the community:
Important news first - the data is available now!
We found out that the problem is not related to StarWind. Disk became RAW because the cluster service stopped and was unable to start due to the DNS server absence. When cluster service is not running, the operating system is not able to recognize the cluster file system and marks it as RAW. After cluster reservation was removed from the disk, the data became available and consistent.
Some reconfigurations have to be performed on the environment to follow all the best practices. The highest priority ones are keeping DC and DNS out of the cluster - https://knowledgebase.starwindsoftware. ... san-usage/
StarWind support team is always ready to assist!