Recover from Cluster Failure

Software-based VM-centric and flash-friendly VM storage + free version

Moderators: anton (staff), art (staff), Max (staff), Anatoly (staff)

Post Reply
OceanCompanies
Posts: 4
Joined: Wed Sep 30, 2020 10:09 pm

Wed Sep 30, 2020 10:25 pm

Hello, I have an interesting situation where we recently recovered from a 2 node cluster failure. What had happened is the two identical servers hosting a VSAN had a fault on 1 DIMM each that ended up destroying the motherboards. One server was completely not operational, but the other was operational under 1 CPU. That server also had a PCIE SSD for cache. The PCIE riser did not work so we didn't have access to that cache disk.

As a result, I was not able to bring the VSAN online because it could not find one of the devices. I found the issue and modified a copy of the swdsk file to remove that SSD. Recovering to the HA device did not work, but attaching it as a FLAT device did work. So I was back in business.

Well now the hardware has been replaced in both nodes and I have access to the SSD again. Now the issue I have is that this one server that was running has been running some VMs while the other was down and has new data.

1. I'm afraid to connect the other server back to the network and lose data during a sync.
2. I am not sure to fix the High Availability since when I tried to reuse my saved swdsk files the device would not be active
3. I'm still running production VMs at the moment and I need to minimize downtime

Anyone have some suggestions or is it just create some full backups and hope for the best?
yaroslav (staff)
Staff
Posts: 2346
Joined: Mon Nov 18, 2019 11:11 am

Thu Oct 01, 2020 11:14 am

Hello,

Is only sync channel disconnected, or heartbeat channel is disconnected too?
OceanCompanies
Posts: 4
Joined: Wed Sep 30, 2020 10:09 pm

Thu Oct 01, 2020 3:39 pm

Sync and Heartbeat channels are disconnected currently.
yaroslav (staff)
Staff
Posts: 2346
Joined: Mon Nov 18, 2019 11:11 am

Fri Oct 02, 2020 7:49 am

And, are the VMs running on that server?
If so, the server is in split brain state right now. There might be data corruption if there are VMs running on that server.

There is workaround though and we can fix that.
OceanCompanies
Posts: 4
Joined: Wed Sep 30, 2020 10:09 pm

Fri Oct 02, 2020 4:00 pm

VMs are only running on one of the servers. The one that has the FLAT device. The other server's data should be from before the hardware failure and no VMs are running on that.
yaroslav (staff)
Staff
Posts: 2346
Joined: Mon Nov 18, 2019 11:11 am

Sat Oct 03, 2020 9:44 am

That's great that no VMs are running on the faulty server.
Are HA devices "not synchronized" on the faulty server?
OceanCompanies
Posts: 4
Joined: Wed Sep 30, 2020 10:09 pm

Tue Oct 06, 2020 9:07 pm

Been working with support on this.

So far the first steps were to force remove the device and image from the second sever and then create a new replication of the good device to that server. That part went well, but took a few hours to finish replicating.

Currently working through some MPIO errors after the sync was completed. I'll do a full update once everything is completely resolved.
yaroslav (staff)
Staff
Posts: 2346
Joined: Mon Nov 18, 2019 11:11 am

Wed Oct 07, 2020 4:04 am

Thanks for reachig out to us.
Oleg(staff)
Staff
Posts: 568
Joined: Fri Nov 24, 2017 7:52 am

Mon Nov 02, 2020 2:27 pm

Hello,
Quick update for the community.
We were able to attach existing SSD cache devices by modifying .swdsk headers.
MPIO issue was resolved with a server reboot.
Thank you!
Post Reply