In doubt after problematic upgrade

Software-based VM-centric and flash-friendly VM storage + free version
Post Reply
Heracles31
Posts: 6
Joined: Mon Jan 06, 2025 10:22 pm
Contact:

Tue Jan 27, 2026 8:08 pm

Hi,

Up to yesterday, both of my VSAN Free appliances (A and B) were running version 1.6.580.7361 in two Proxmos hosts version 9.1.1. I then started to do some upgrades and now I am not sure where I stand.

When I tried to upgrade VSAN Free appliance B, it first said that the upgrade failed. Still, it keeps showing as running but the progress remained at the same low level of about 5%. After a long while and despite the message saying not to reboot during the upgrade, the process was obviously blocked somewhere. I opened the shell and looked at the process to see anything about an ongoing upgrade in progress. I did not find anything. I then decided to upgrade Proxmox to 9.1.4 (successful) and after that, rebooted despite VSAN doubtful upgrade.

When the system came back online, VSAN Appliance B said that it is now running v1.7.794.7576 but that its parter appliance (A) is offline. In the Manage LUN menu, LUN Availability, it says that itself is Synchronized but that A is Offline / Unknown.

When looking at the same view from VSAN A, it says that both appliance are Online and that both are Synchronized. In the list of Appliances, it sees that its partner Appliance B is up to 1.7 version and itself is still as 1.6.

So questions are :
--How can I ensure that the upgrade of appliance B completed successfully and that it is trust worthy?
--How can I ensure that as of now, the replicated storage is indeed synced between the 2 and does not contain errors / corruption / other problems ?
--How should I proceed for upgrading Appliance A and have both of them at the same level, without risking to lose my data ?

Thanks for your help and support,
yaroslav (staff)
Staff
Posts: 4309
Joined: Mon Nov 18, 2019 11:11 am

Wed Jan 28, 2026 8:56 am

Could you please share the logs from the VM that is stuck with the update? Collecting the support bundle should do it.
Appliances view inconsistency is expected when VSAN is running different versions.
--How can I ensure that the upgrade of appliance B completed successfully and that it is trust worthy?
--How can I ensure that as of now, the replicated storage is indeed synced between the 2 and does not contain errors / corruption / other problems ?
Share the logs. Download the management console from the downloads tab. Connect to StarWind CVMs to make sure that HA devices are synchronized
As for corruption, I doubt it happened, as there should have been nothing special for starwind-virtual-san.service inside the VM. Just another stop and restart.
--How should I proceed for upgrading Appliance A and have both of them at the same level, without risking to lose my data ?
Just hit the update button and make sure to follow the update prerequisites.
I would suggest stopping StarWind Virtual SAN manually starwind-virtual-san.service before the update.
Heracles31
Posts: 6
Joined: Mon Jan 06, 2025 10:22 pm
Contact:

Wed Jan 28, 2026 2:34 pm

Thanks for the reply.

Support bundle is too big to be sent as an attachment, so here is a link to it :

https://cumulus.jblan.ca/s/7B5L9aP5CG4CxRC

The fact that the difference is expected when both appliance are on different versions is re-assuring. I will let you review the logs and should you not see anything wrong, I will proceed with the second upgrade. I will be sure to stop VSAN as recommended before doing it.
yaroslav (staff)
Staff
Posts: 4309
Joined: Mon Nov 18, 2019 11:11 am

Wed Jan 28, 2026 6:16 pm

Hi,

Thanks for your update.
According to the logs, the replication completed, and both hosts seem to be reconnected. The VSAN build is up-to-date.
If you can see that the devices are synchronized in Windows Management Console and they have active iSCSI sessions in Proxmox, you are good to go with the update.
Heracles31
Posts: 6
Joined: Mon Jan 06, 2025 10:22 pm
Contact:

Wed Jan 28, 2026 8:52 pm

Thanks a lot for the assessment and support with this. I will do it tonight during a maintenance window and will confirm the result once done.
yaroslav (staff)
Staff
Posts: 4309
Joined: Mon Nov 18, 2019 11:11 am

Wed Jan 28, 2026 10:39 pm

Fingers crossed! Thanks for sharing the log. I have forwarded them to the dev team to see why the update stalled.
Good luck with the update.
Heracles31
Posts: 6
Joined: Mon Jan 06, 2025 10:22 pm
Contact:

Thu Jan 29, 2026 3:46 am

Hi again,

So I tried the second upgrade and unfortunately, I am now in even more doubt...

1-The same error during the upgrade

At about 5%, the appliance said that the upgrade failed. I logged in the console right away to monitor the activity instead of waiting blindly like I did first time. I also confirmed that the starwind vsan service was indeed stopped.

2-Upgrade was still running up to its own reboot

There was a lot of local processes running, all related to the upgrade: make, gcc, shell scripts, apt package management, ... Once they were done, the appliance rebooted on its own.

3-No connection between the two appliances after that reboot

Appliance B, upgraded first, remained in its last state : No contact with Appliance A and presuming it was still running 1.6 (wrong.... 1.7 now).
Appliance A came back aware of its own new version 1.7 but without contact to Appliance B which it still presume is running 1.7 (right of course).

4-Testing network connection between the 2 appliances

From the shell, the appliances can see and ping each other over each of their 3 network interfaces. So clearly no network / IP level problem here.

5-2 new support bundles

Here is another support bundle from Appliance B, the first to be updated
https://cumulus.jblan.ca/s/g35Xw6G9anji7Xm

And here is the support bundle from Appliance A that I just updated.
https://cumulus.jblan.ca/s/joXSdCEBNzBN3Rd

So as of now, I fear the split brain problem : each appliance thinking it runs by itself and starts doing changes on its own set of data. Once the connection will be re-established, the two data sets will be incompatible and may lead to corruption / data loss. For that, I tried to access the LUN only from Appliance B and will be ready to consider that one as valid and give up on whatever changes may have been sent to Appliance A.

Any idea what I should do from here ?

Thanks again for your help,
yaroslav (staff)
Staff
Posts: 4309
Joined: Mon Nov 18, 2019 11:11 am

Thu Jan 29, 2026 8:06 am

First things first: did you stop starwind-virtual-san.service prior to the update? The restart is a sign that the update ran through.
If you are not certain about the HA devices' status, check the Windows-based Management Console. I Split-brain is where each node says, "Partner is not synchronized, but I am." see more at https://www.starwindsoftware.com/blog/w ... -avoid-it/
Your logs, in turn, say that "replication was completed, both nodes are synchornized" I ses no anomaly in the starwind-virtual-san.service behavior from the logs. This means that the storage functioning of the VM is alright and the issue is somewhere on the UI side.
Another piece of good news: From the logs I can see that the VSAN service is up-to-date.

Please try:
1. See if the time on each node is syncd. If there is a time difference, they will be grumpy.
2. Remove ANY node from the Appliances menu (yes, ignore the message that pops up, if you remove the appliance, literally nothing bad happens in 20086).
3. Join it back.
Heracles31
Posts: 6
Joined: Mon Jan 06, 2025 10:22 pm
Contact:

Thu Jan 29, 2026 2:45 pm

Hi again,

Thanks for the reassuring infos and your support.

You have found a problem : NTP. The server is configured under an old name on each. For that, neither can find it so they are probably not synchronized properly. The thing is, the value is greyed out in the UI and I can not fix it.

Settings / General / NTP servers

Even if I click on the Edit button at the bottom, it allows me to edit only the appliance's name and not the NTP.

I went under the hood in the shell to fix it there. I changed it manually in /etc/systemd/timesyncd.conf. The service was not running despite being enabled, so I started it manually. Did it on both appliances. After that, they said their time is synced properly but the old name is still the one that shows in the General Settings page. The appliances were also still not able to communicate with each others.

You also provided the solution.

Because I considered appliance B was the most up-to-date (if only one was...), I chose to remove Appliance A from its setting and re-add it. I ignored the warning message and indeed, as soon as Appliance A was re-added, it joined properly and was instantly declared as synchronized. In the same way, going back to Appliance A's interface, it now sees its partner and both agree on everything : both are online, both are connected to their partner, both are synchronized, both are running version 1.7, ...

Thanks a lot for your support and helping me through this safely.
yaroslav (staff)
Staff
Posts: 4309
Joined: Mon Nov 18, 2019 11:11 am

Thu Jan 29, 2026 8:42 pm

Great news! Thanks for your update.
Post Reply