Software-based VM-centric and flash-friendly VM storage + free version
Moderators: anton (staff), art (staff), Max (staff), Anatoly (staff)
-
xpystchrisx
- Posts: 26
- Joined: Tue Jun 05, 2018 6:20 pm
Sat Feb 09, 2019 7:09 pm
Earlier this morning one of my nodes stopped syncing with the other. Everything on the node which experienced the issue effectively shut off and operations continued as normal.
I now have both systems back online and for the past hour and a half watched the first of nine HA devices synchronize. However at this point no further synchronization is happening and I can't manually kick off the sync like I typically could in the past calling the "SyncHADeviceAdvanced.ps1" script.
I am seeing a bunch of this in the node that is currently online and showing good.
2/9 16:51:36.087 1af0 HA: CHAPartnerNode::SendPartnerNodeVersionRequestCommand: Try to get partner node version through heartbeat channel.
2/9 16:51:36.087 1af0 HA: *** CHAPartnerISCSIChannelManager::SendCustomControlScsiCommand: Valid channel not found!
2/9 16:51:36.087 1af0 HA: *** CHAPartnerNode::SendPartnerNodeVersionRequestCommand: EXITing with failure, SendCustomControlScsiCommand(HA_CHANNEL_TYPE_HEARTBEAT) failed, error code 1168, scsi status = 0!
2/9 16:51:36.087 1af0 HA: *** CHAPartnerNode::SendForwardClientDataOutCommand: EXITing with failure, partner node version 0x0 is not supported or invalid. Nothing will be sent!
That basically repeats over and over.
I will admit that before the reboot patches were applied to the system (stupid mistake on my part) and I toyed with the idea of running a StarWind update, however I read that the software should be in a complete sync status before running the patch, so I dropped out of the installer. Perhaps that is causing the sync block?
Any assistance would be great.
-
xpystchrisx
- Posts: 26
- Joined: Tue Jun 05, 2018 6:20 pm
Mon Feb 11, 2019 3:27 pm
The more I look at this the more I'm thinking that something happened to one of my nodes and caused all of the HA devices on it to become corrupt.
Is there a way that I can remove the HA from the existing volumes and then recreate HA by adding a replica partner? or do I need to create all new HA_LSFS devices and migrate to them?
-
xpystchrisx
- Posts: 26
- Joined: Tue Jun 05, 2018 6:20 pm
Mon Feb 11, 2019 7:15 pm
So having done absolutely nothing other than letting both systems sit... I'm now seeing a status of Synchronizing on one drive and the others are waiting. Not sure what happened for 24h but I guess it was courting the HA replica? Maybe they took a night out to get re-acquainted? I'm not sure...
-
Oleg(staff)
- Staff
- Posts: 568
- Joined: Fri Nov 24, 2017 7:52 am
Wed Feb 13, 2019 10:47 am
Could you please collect the logs from the nodes and log a support case using
this form?
Please use
this tool for log collection.
Please refer to this thread.
-
xpystchrisx
- Posts: 26
- Joined: Tue Jun 05, 2018 6:20 pm
Wed Feb 13, 2019 5:31 pm
I will upload the logs now. I grabbed logs when the problem first happened. I will upload those. One drive was not able to come back, but the rest of the drives are online and HA now.
-
xpystchrisx
- Posts: 26
- Joined: Tue Jun 05, 2018 6:20 pm
Thu Feb 14, 2019 1:29 pm
The logs are uploaded. Hoping that this was some kind of hardware issue.
I will comment that I updated to the latest build of VSAN while we were out (because I like to play fast and loose like that) and it seems to have resolved a memory leak issue I was experiencing.
-
xpystchrisx
- Posts: 26
- Joined: Tue Jun 05, 2018 6:20 pm
Thu Feb 14, 2019 8:24 pm
Hi Oleg - I uploaded the logs to the support site as requested. I actually did this on Wednesday when you asked for the logs but I didn't post a reply here until this morning. I named the support case the same as this forum post in my subject.
-
Oleg(staff)
- Staff
- Posts: 568
- Joined: Fri Nov 24, 2017 7:52 am
Fri Feb 15, 2019 10:06 am
Unfortunately, we did not get any emails from you. That is why I am asking.
Could you please send us the logs one more time?
-
xpystchrisx
- Posts: 26
- Joined: Tue Jun 05, 2018 6:20 pm
Fri Feb 15, 2019 1:01 pm
I attempted to post the case this morning and after pushing submit I got a 500 error. Can you tell me if you received the logs? If not I'll post them up again.
-
Oleg(staff)
- Staff
- Posts: 568
- Joined: Fri Nov 24, 2017 7:52 am
Fri Feb 15, 2019 5:06 pm
Could you please check the size of zip archive with logs?
If the size is more than 20 MB, please upload to filesharing service and send us the link.