Recover from Cluster Failure

OceanCompanies · Wed Sep 30, 2020 10:25 pm

Hello, I have an interesting situation where we recently recovered from a 2 node cluster failure. What had happened is the two identical servers hosting a VSAN had a fault on 1 DIMM each that ended up destroying the motherboards. One server was completely not operational, but the other was operational under 1 CPU. That server also had a PCIE SSD for cache. The PCIE riser did not work so we didn't have access to that cache disk.

As a result, I was not able to bring the VSAN online because it could not find one of the devices. I found the issue and modified a copy of the swdsk file to remove that SSD. Recovering to the HA device did not work, but attaching it as a FLAT device did work. So I was back in business.

Well now the hardware has been replaced in both nodes and I have access to the SSD again. Now the issue I have is that this one server that was running has been running some VMs while the other was down and has new data.

1. I'm afraid to connect the other server back to the network and lose data during a sync.
2. I am not sure to fix the High Availability since when I tried to reuse my saved swdsk files the device would not be active
3. I'm still running production VMs at the moment and I need to minimize downtime

Anyone have some suggestions or is it just create some full backups and hope for the best?

Thu Oct 01, 2020 11:14 am

Hello,

Is only sync channel disconnected, or heartbeat channel is disconnected too?

OceanCompanies · Thu Oct 01, 2020 3:39 pm

Sync and Heartbeat channels are disconnected currently.

Fri Oct 02, 2020 7:49 am

And, are the VMs running on that server?
If so, the server is in split brain state right now. There might be data corruption if there are VMs running on that server.

There is workaround though and we can fix that.

OceanCompanies · Fri Oct 02, 2020 4:00 pm

VMs are only running on one of the servers. The one that has the FLAT device. The other server's data should be from before the hardware failure and no VMs are running on that.

Sat Oct 03, 2020 9:44 am

That's great that no VMs are running on the faulty server.
Are HA devices "not synchronized" on the faulty server?

OceanCompanies · Tue Oct 06, 2020 9:07 pm

Been working with support on this.

So far the first steps were to force remove the device and image from the second sever and then create a new replication of the good device to that server. That part went well, but took a few hours to finish replicating.

Currently working through some MPIO errors after the sync was completed. I'll do a full update once everything is completely resolved.

Wed Oct 07, 2020 4:04 am

Thanks for reachig out to us.

Mon Nov 02, 2020 2:27 pm

Hello,
Quick update for the community.
We were able to attach existing SSD cache devices by modifying .swdsk headers.
MPIO issue was resolved with a server reboot.
Thank you!