Cluster Storage offline, can't bring online

Mareo · Fri Jun 28, 2024 1:02 pm

Greetings,

This is my first time posting here.

We've encountered an issue twice now where the Starwind services could not be brought back online. The service falls out of sync, preventing me from bringing the VM, CSV, and Witness online.

I've seen similar issues discussed here multiple times, but despite following the steps closely, I couldn't resolve the problem. I think I need specific guidance for my case. Could you help me interpret the logs? If needed, I can upload them (which I believe will be necessary).

The problem starts when we connect to the servers and find them they were shutdown. Upon launching Cluster Failover Management, both servers appear disconnected from the Cluster. Consequently, the VM goes offline along with the Storages and said nodes. Attempting to bring the VM online fails due to a storage error. Both the CSV and the Witness are offline, and efforts to bring them online result in errors indicating they are not connected to the node.

However, both nodes are online.

Furthermore, when enabling Starwind and connecting to the servers, both show as "not synchronized." Running PowerShell scripts also fails, with errors such as "200 Failed: can't find partner for sync..."

Additionally, the iSCSI initiators are stuck in a reconnecting state.

Note: We are using the free version of Starwind. The operating system is Windows Server 2022 Standard, and we are using HyperV.

I'm available to provide any additional information you might need.

Thank you in advance, and I look forward to your assistance.

Kind regards,
Mario

Fri Jun 28, 2024 11:18 pm

Thanks for your email. Did you have a power outage? Do your devices have a write-back cache on them?

Additionally, the iSCSI initiators are stuck in a reconnecting state.

Are devices from both sides stuck in reconnecting?
Check this post out viewtopic.php?f=5&t=6779&p=36805&hilit= ... %3E#p36805.

There might be another problem: The domain controller is in the cluster. As a result, the cluster cannot start because DCs are down and DCs do not start as the cluster is down. See more on DC placement https://knowledgebase.starwindsoftware. ... san-usage/.
Force-start the cluster as described here https://learn.microsoft.com/en-us/sql/s ... rver-ver16.

Mareo · Mon Jul 01, 2024 8:29 am

Hello,

thank you for the reply.

Kindly read the EDIT first.

I will try these things first thing possible.

However, I do have an additional question. If we only remain on one node, can I recover the VM from HyperV? At the current state, we are only on one node and we DO NOT have access to the Cluster to retrieve the VM. VM is in an offline state and we cant access the Cluster folder since there is currently no cluster due to there being only one node. Is that affected by Starwind or is there no correlation between the two and I just need to find a way(to get the VM back)?

Thanks in advance and kind regards,
Mario

EDIT: Please, disregard this post. I mixed up the nodes.

Mon Jul 01, 2024 8:57 am

Yes, you can continue running with only one node marked as synchronized. Be careful and check your data before resuming the other node.
Also, considre removing cache once things get to norm again. And, last but not least, move DCs out of the cluster and cluster shared volumes.
Please note that the failover cluster is non-StarWind component. I cannot predict if it can be force started.

Mareo · Mon Jul 01, 2024 10:23 am

I just saw your reply. I edited the post your replied to. Regardless, thank you for your assistance.

Furthermore, I forgot to mention that we do not have the DCs in the Cluster. They are standalone VMs. Does this change the situation?

Kind regards,
Mario

Mon Jul 01, 2024 11:08 am

Thanks for your update.
StarWind VSAN does not need a DC for its operation. Mark one StrarWind node as synchronized manually and try bringing up the cluster. Once the cluster starts and StarWind devices are marked as synchronized on one node, you should be able to access the CSVs.

Mareo · Mon Jul 01, 2024 12:26 pm

Referring to a post you quoted regarding the sync status:

yaroslav (staff) wrote: ↑
Sun May 26, 2024 1:52 pm
Try marking sync_status 1 only the first occurance. Set the latter to 0.
Make sure to edit the files while StarWindService is stopped.
Last but not least, remove write back cache: should make such incidents less frequent.

I'm kinda lost in what file do I need to change the sync status? HA file on the first node is for the second node(target) and vice versa. Does the sync status refer to the target node above it? If I want to set sync status 1 for the second node in which HA file do I edit the sync status?

Mon Jul 01, 2024 10:12 pm

Hi,

The first one is for the local device. every other occurrence is for the respective partner.
You need to first mark one node as synchronized and check the data consistency. If the data is OK make sure to mark the device with the respective target name in the config files with <sync_status>1</sync_status>

Mareo · Thu Jul 04, 2024 6:10 am

Hi,

firstly, I would like to thank you for your support.

Secondly, I got it working! I followed the links and instructions you posted, edited the HA images with steps in between like stopping/starting services, then I used the GetHASyncStatus file. It took me a few tries, but it finally synced.

The resources are back online and the VM is up and running as well.

However, can you kindly confirm would the same work for a similar issue, but where everything is online. The resources and the VM is online and everything is working. Only the HAImages in Starwind are showing unsync status. I take it I wont be needing to edit the HA images? Maybe just do the GetHASyncStatus?

Thanks in advance and kind regards,
Mario

EDIT: Sorry, I mixed it up.. The second issue in question that one node has the CSV and the witness error images as in "Header not found". But for the first node the the second node is down/unsynchronized.

Thu Jul 04, 2024 8:32 am

Secondly, I got it working! I followed the links and instructions you posted, edited the HA images with steps in between like stopping/starting services, then I used the GetHASyncStatus file. It took me a few tries, but it finally synced.

Sorry, I don't quite follow. Is the problem still there or is it a different system?
Please check which StarWind VSAN version you are running first. In StarWind Management Console, select a server -> Configuration -> Register and see the version there.

Mareo · Mon Jul 08, 2024 6:09 am

yaroslav (staff) wrote: ↑
Thu Jul 04, 2024 8:32 am

Secondly, I got it working! I followed the links and instructions you posted, edited the HA images with steps in between like stopping/starting services, then I used the GetHASyncStatus file. It took me a few tries, but it finally synced.
Sorry, I don't quite follow. Is the problem still there or is it a different system?
Please check which StarWind VSAN version you are running first. In StarWind Management Console, select a server -> Configuration -> Register and see the version there.

This is the same system.

But, the second part of the email is about a different system. I just remembered that I copied the witness and CSV files to the second node. I did find a thread about a similar issue, unless I mixed it up and got it wrong. So, naturally, my question is - Can I copy the images from one node to the other, edit the images to fit the second node(change target name, sync status, etc.) and get it working again?

Mon Jul 08, 2024 10:00 am

Hi,

Theoretically yes. But that introduces a room for mistake and data is likely to be inconsistent as the new data arrives and gets altere during copy. Furthermore, iI would still liketo let full synchronization running.
If there's volume recreation, please run replication partner removal on the only active node. Or alter the _HA.swdsk to remove the replication partner (requires replication partner removal).then, run addreplicationpartner script.
Let me know if you would like to learn more details.

Mareo · Mon Sep 16, 2024 12:50 pm

Hi yaroslav,

hope all is well.

I am sorry for writing again in the same thread, but its kinda easier for me to keep track of things. I also apologize for not confirming anything after our last correspondence.

Everything was working fine since I last confirmed and I once again thank you for your assistance. Nothing much has changed, but the same issue has arose again. We discovered the root of the problem. It seems that the fuse within the structure periodically blows out and thus rendering the servers without power. The UPS comes online, but before anyone notices, the UPS is also drained of power. So there is a certain amount of time that the serves are without power, at all.

But, what we don't understand is why VSAN does not run "as intended" when we turn the whole thing on. The VM disappears, clustering fails and we have to reconnect the whole VSAN over again, sync the targets, revive the storage, etc.. Similar to the issue from the start of the thread.

Regardless, my question is as follows - After trying to revive or restore the connection on VSAN for the two nodes, I've lost one of the targets. I tried to replicate the steps from last time. I managed to sync the "witness", but when I tried to do the same with the "CSV", its target just disappeared. The image is there and -HA.swdsk also, but it is not showing in VSAN. If I try anything via powershell, all I get is "no device found".

Thanks in advance and I am at your full disposal for any further questions.

P.S.
Would we be able to mitigate the problem with disappearing targets and VSAN dropping out of sync if we just rebuild the Cluster from scratch?

Kind regards,
Mario

Mon Sep 16, 2024 5:56 pm

Mario,

No worries at all. Did you try replicating an *.img?
If so, the target does go dark as the HA header is introduced. That header has a device ID on both nodes. Also, the change is made to the StarWind.cfg.
That is why the target goes dark for a moment.

p.s. Here, the initiator is a very big deal as each initiator has its own threshold for the target going dark. For instance, you can pretty much easy recreate the HA from the IMG in ESXi, but the trick does not work that much in Windows. The reason is that the target disconnection thresholds are different for these two platforms.

Mareo · Tue Sep 17, 2024 6:06 am

Hi yaroslav,

thank you for your quick reply.

How would I go about replicating the *.img file? Ive seen the pshell file, if we are talking about the VTLReplicationSettings.ps1, but I'm not to confident with powershell.

What changes would I need to make to the starwind.cfg file? As for the device ID, are you referring to the serial_id?

I feel like it would be a lot simpler for me to build a cluster from scratch with new images then bother you with this