HA Devices Won't Resync After Reboot

Software-based VM-centric and flash-friendly VM storage + free version

Moderators: anton (staff), art (staff), Max (staff), Anatoly (staff)

Post Reply
xpystchrisx
Posts: 26
Joined: Tue Jun 05, 2018 6:20 pm

Sat Feb 09, 2019 7:09 pm

Earlier this morning one of my nodes stopped syncing with the other. Everything on the node which experienced the issue effectively shut off and operations continued as normal.
I now have both systems back online and for the past hour and a half watched the first of nine HA devices synchronize. However at this point no further synchronization is happening and I can't manually kick off the sync like I typically could in the past calling the "SyncHADeviceAdvanced.ps1" script.
I am seeing a bunch of this in the node that is currently online and showing good.
2/9 16:51:36.087 1af0 HA: CHAPartnerNode::SendPartnerNodeVersionRequestCommand: Try to get partner node version through heartbeat channel.
2/9 16:51:36.087 1af0 HA: *** CHAPartnerISCSIChannelManager::SendCustomControlScsiCommand: Valid channel not found!
2/9 16:51:36.087 1af0 HA: *** CHAPartnerNode::SendPartnerNodeVersionRequestCommand: EXITing with failure, SendCustomControlScsiCommand(HA_CHANNEL_TYPE_HEARTBEAT) failed, error code 1168, scsi status = 0!
2/9 16:51:36.087 1af0 HA: *** CHAPartnerNode::SendForwardClientDataOutCommand: EXITing with failure, partner node version 0x0 is not supported or invalid. Nothing will be sent!
That basically repeats over and over.
I will admit that before the reboot patches were applied to the system (stupid mistake on my part) and I toyed with the idea of running a StarWind update, however I read that the software should be in a complete sync status before running the patch, so I dropped out of the installer. Perhaps that is causing the sync block?
Any assistance would be great.
xpystchrisx
Posts: 26
Joined: Tue Jun 05, 2018 6:20 pm

Mon Feb 11, 2019 3:27 pm

The more I look at this the more I'm thinking that something happened to one of my nodes and caused all of the HA devices on it to become corrupt.
Is there a way that I can remove the HA from the existing volumes and then recreate HA by adding a replica partner? or do I need to create all new HA_LSFS devices and migrate to them?
xpystchrisx
Posts: 26
Joined: Tue Jun 05, 2018 6:20 pm

Mon Feb 11, 2019 7:15 pm

So having done absolutely nothing other than letting both systems sit... I'm now seeing a status of Synchronizing on one drive and the others are waiting. Not sure what happened for 24h but I guess it was courting the HA replica? Maybe they took a night out to get re-acquainted? I'm not sure...
Oleg(staff)
Staff
Posts: 568
Joined: Fri Nov 24, 2017 7:52 am

Wed Feb 13, 2019 10:47 am

Could you please collect the logs from the nodes and log a support case using this form?
Please use this tool for log collection.
Please refer to this thread.
xpystchrisx
Posts: 26
Joined: Tue Jun 05, 2018 6:20 pm

Wed Feb 13, 2019 5:31 pm

I will upload the logs now. I grabbed logs when the problem first happened. I will upload those. One drive was not able to come back, but the rest of the drives are online and HA now.
xpystchrisx
Posts: 26
Joined: Tue Jun 05, 2018 6:20 pm

Thu Feb 14, 2019 1:29 pm

The logs are uploaded. Hoping that this was some kind of hardware issue.
I will comment that I updated to the latest build of VSAN while we were out (because I like to play fast and loose like that) and it seems to have resolved a memory leak issue I was experiencing.
Oleg(staff)
Staff
Posts: 568
Joined: Fri Nov 24, 2017 7:52 am

Thu Feb 14, 2019 4:12 pm

Could you please clarify the location of the logs?
xpystchrisx
Posts: 26
Joined: Tue Jun 05, 2018 6:20 pm

Thu Feb 14, 2019 8:24 pm

Hi Oleg - I uploaded the logs to the support site as requested. I actually did this on Wednesday when you asked for the logs but I didn't post a reply here until this morning. I named the support case the same as this forum post in my subject.
Oleg(staff)
Staff
Posts: 568
Joined: Fri Nov 24, 2017 7:52 am

Fri Feb 15, 2019 10:06 am

Unfortunately, we did not get any emails from you. That is why I am asking.
Could you please send us the logs one more time?
xpystchrisx
Posts: 26
Joined: Tue Jun 05, 2018 6:20 pm

Fri Feb 15, 2019 1:01 pm

I attempted to post the case this morning and after pushing submit I got a 500 error. Can you tell me if you received the logs? If not I'll post them up again.
Oleg(staff)
Staff
Posts: 568
Joined: Fri Nov 24, 2017 7:52 am

Fri Feb 15, 2019 5:06 pm

Could you please check the size of zip archive with logs?
If the size is more than 20 MB, please upload to filesharing service and send us the link.
xpystchrisx
Posts: 26
Joined: Tue Jun 05, 2018 6:20 pm

Fri Feb 15, 2019 5:09 pm

Yep, that's it. Sending it shortly.
Sorry if I missed the fact that it should be less than 20MB.
xpystchrisx
Posts: 26
Joined: Tue Jun 05, 2018 6:20 pm

Fri Feb 15, 2019 5:16 pm

Sent in. Please let me know if the ZIP comes down properly.
Oleg(staff)
Staff
Posts: 568
Joined: Fri Nov 24, 2017 7:52 am

Sun Feb 17, 2019 9:15 am

Thank you!
Yes, we got the logs. We will keep you updated with the progress.
Post Reply