Cluster Storage offline, can't bring online

Initiator (iSCSI, FCoE, AoE, iSER and NVMe over Fabrics), iSCSI accelerator and RAM disk

Moderators: anton (staff), art (staff), Anatoly (staff), Max (staff)

Post Reply
Mareo
Posts: 6
Joined: Tue May 21, 2024 11:35 am

Fri Jun 28, 2024 1:02 pm

Greetings,

This is my first time posting here.

We've encountered an issue twice now where the Starwind services could not be brought back online. The service falls out of sync, preventing me from bringing the VM, CSV, and Witness online.

I've seen similar issues discussed here multiple times, but despite following the steps closely, I couldn't resolve the problem. I think I need specific guidance for my case. Could you help me interpret the logs? If needed, I can upload them (which I believe will be necessary).

The problem starts when we connect to the servers and find them they were shutdown. Upon launching Cluster Failover Management, both servers appear disconnected from the Cluster. Consequently, the VM goes offline along with the Storages and said nodes. Attempting to bring the VM online fails due to a storage error. Both the CSV and the Witness are offline, and efforts to bring them online result in errors indicating they are not connected to the node.

However, both nodes are online.

Furthermore, when enabling Starwind and connecting to the servers, both show as "not synchronized." Running PowerShell scripts also fails, with errors such as "200 Failed: can't find partner for sync..."

Additionally, the iSCSI initiators are stuck in a reconnecting state.

Note: We are using the free version of Starwind. The operating system is Windows Server 2022 Standard, and we are using HyperV.

I'm available to provide any additional information you might need.

Thank you in advance, and I look forward to your assistance.

Kind regards,
Mario
yaroslav (staff)
Staff
Posts: 2678
Joined: Mon Nov 18, 2019 11:11 am

Fri Jun 28, 2024 11:18 pm

Thanks for your email. Did you have a power outage? Do your devices have a write-back cache on them?
Additionally, the iSCSI initiators are stuck in a reconnecting state.
Are devices from both sides stuck in reconnecting?
Check this post out viewtopic.php?f=5&t=6779&p=36805&hilit= ... %3E#p36805.

There might be another problem: The domain controller is in the cluster. As a result, the cluster cannot start because DCs are down and DCs do not start as the cluster is down. See more on DC placement https://knowledgebase.starwindsoftware. ... san-usage/.
Force-start the cluster as described here https://learn.microsoft.com/en-us/sql/s ... rver-ver16.
Mareo
Posts: 6
Joined: Tue May 21, 2024 11:35 am

Mon Jul 01, 2024 8:29 am

Hello,

thank you for the reply.

Kindly read the EDIT first.

I will try these things first thing possible.

However, I do have an additional question. If we only remain on one node, can I recover the VM from HyperV? At the current state, we are only on one node and we DO NOT have access to the Cluster to retrieve the VM. VM is in an offline state and we cant access the Cluster folder since there is currently no cluster due to there being only one node. Is that affected by Starwind or is there no correlation between the two and I just need to find a way(to get the VM back)?

Thanks in advance and kind regards,
Mario

EDIT: Please, disregard this post. I mixed up the nodes.
Last edited by Mareo on Mon Jul 01, 2024 10:22 am, edited 1 time in total.
yaroslav (staff)
Staff
Posts: 2678
Joined: Mon Nov 18, 2019 11:11 am

Mon Jul 01, 2024 8:57 am

Yes, you can continue running with only one node marked as synchronized. Be careful and check your data before resuming the other node.
Also, considre removing cache once things get to norm again. And, last but not least, move DCs out of the cluster and cluster shared volumes.
Please note that the failover cluster is non-StarWind component. I cannot predict if it can be force started.
Mareo
Posts: 6
Joined: Tue May 21, 2024 11:35 am

Mon Jul 01, 2024 10:23 am

I just saw your reply. I edited the post your replied to. Regardless, thank you for your assistance.

Furthermore, I forgot to mention that we do not have the DCs in the Cluster. They are standalone VMs. Does this change the situation?

Kind regards,
Mario
yaroslav (staff)
Staff
Posts: 2678
Joined: Mon Nov 18, 2019 11:11 am

Mon Jul 01, 2024 11:08 am

Thanks for your update.
StarWind VSAN does not need a DC for its operation. Mark one StrarWind node as synchronized manually and try bringing up the cluster. Once the cluster starts and StarWind devices are marked as synchronized on one node, you should be able to access the CSVs.
Mareo
Posts: 6
Joined: Tue May 21, 2024 11:35 am

Mon Jul 01, 2024 12:26 pm

Referring to a post you quoted regarding the sync status:
yaroslav (staff) wrote:
Sun May 26, 2024 1:52 pm
Try marking sync_status 1 only the first occurance. Set the latter to 0.
Make sure to edit the files while StarWindService is stopped.
Last but not least, remove write back cache: should make such incidents less frequent.


I'm kinda lost in what file do I need to change the sync status? HA file on the first node is for the second node(target) and vice versa. Does the sync status refer to the target node above it? If I want to set sync status 1 for the second node in which HA file do I edit the sync status?
yaroslav (staff)
Staff
Posts: 2678
Joined: Mon Nov 18, 2019 11:11 am

Mon Jul 01, 2024 10:12 pm

Hi,

The first one is for the local device. every other occurrence is for the respective partner.
You need to first mark one node as synchronized and check the data consistency. If the data is OK make sure to mark the device with the respective target name in the config files with <sync_status>1</sync_status>
Mareo
Posts: 6
Joined: Tue May 21, 2024 11:35 am

Thu Jul 04, 2024 6:10 am

Hi,

firstly, I would like to thank you for your support.

Secondly, I got it working! I followed the links and instructions you posted, edited the HA images with steps in between like stopping/starting services, then I used the GetHASyncStatus file. It took me a few tries, but it finally synced.

The resources are back online and the VM is up and running as well.

However, can you kindly confirm would the same work for a similar issue, but where everything is online. The resources and the VM is online and everything is working. Only the HAImages in Starwind are showing unsync status. I take it I wont be needing to edit the HA images? Maybe just do the GetHASyncStatus?

Thanks in advance and kind regards,
Mario


EDIT: Sorry, I mixed it up.. The second issue in question that one node has the CSV and the witness error images as in "Header not found". But for the first node the the second node is down/unsynchronized.
yaroslav (staff)
Staff
Posts: 2678
Joined: Mon Nov 18, 2019 11:11 am

Thu Jul 04, 2024 8:32 am

Secondly, I got it working! I followed the links and instructions you posted, edited the HA images with steps in between like stopping/starting services, then I used the GetHASyncStatus file. It took me a few tries, but it finally synced.
Sorry, I don't quite follow. Is the problem still there or is it a different system?
Please check which StarWind VSAN version you are running first. In StarWind Management Console, select a server -> Configuration -> Register and see the version there.
Mareo
Posts: 6
Joined: Tue May 21, 2024 11:35 am

Mon Jul 08, 2024 6:09 am

yaroslav (staff) wrote:
Thu Jul 04, 2024 8:32 am
Secondly, I got it working! I followed the links and instructions you posted, edited the HA images with steps in between like stopping/starting services, then I used the GetHASyncStatus file. It took me a few tries, but it finally synced.
Sorry, I don't quite follow. Is the problem still there or is it a different system?
Please check which StarWind VSAN version you are running first. In StarWind Management Console, select a server -> Configuration -> Register and see the version there.
This is the same system.

But, the second part of the email is about a different system. I just remembered that I copied the witness and CSV files to the second node. I did find a thread about a similar issue, unless I mixed it up and got it wrong. So, naturally, my question is - Can I copy the images from one node to the other, edit the images to fit the second node(change target name, sync status, etc.) and get it working again?
yaroslav (staff)
Staff
Posts: 2678
Joined: Mon Nov 18, 2019 11:11 am

Mon Jul 08, 2024 10:00 am

Hi,

Theoretically yes. But that introduces a room for mistake and data is likely to be inconsistent as the new data arrives and gets altere during copy. Furthermore, iI would still liketo let full synchronization running.
If there's volume recreation, please run replication partner removal on the only active node. Or alter the _HA.swdsk to remove the replication partner (requires replication partner removal).then, run addreplicationpartner script.
Let me know if you would like to learn more details.
Post Reply