Page 1 of 2

Replication and Hearthbeat channel down?

Posted: Wed Oct 11, 2017 2:17 pm
by jortie
Hi,

I have a two node cluster. Node 2 reports that both sync and heartbeat channels are down to node 1.
The replication manager on Node 1 shows the sync and heartbeat channel towards node 2 as up.

I can ping the IP addresses on both sides for sync and heartbeat. Also telnet to port 3260 succeeds in all directions.

This is the log part I have on node 2.

Code: Select all

10/11 15:20:45.265 2fe0 HA: CHAPartnerNode::SendPartnerNodeVersionRequestCommand: SendCustomControlScsiCommand(HA_CHANNEL_TYPE_SYNC) failed, error code 1168, scsi status = 0!
10/11 15:20:45.265 2fe0 HA: CHAPartnerNode::SendPartnerNodeVersionRequestCommand: Try to get partner node version through heartbeat channel.
10/11 15:20:45.265 2fe0 HA: CHAPartnerISCSIChannelManager::SendCustomControlScsiCommand: Valid channel not found!
10/11 15:20:45.265 2fe0 HA: CHAPartnerNode::SendPartnerNodeVersionRequestCommand: EXITing with failure, SendCustomControlScsiCommand(HA_CHANNEL_TYPE_HEARTBEAT) failed, error code 1168, scsi status = 0!
10/11 15:20:45.265 2fe0 HA: CHAPartnerNode::SendGetPartnerNodeInfoCommand: EXITing with failure, partner node version update failed!
10/11 15:20:47.459 1b98 Sw: *** iscsi_tcp_dispatch: iscsi_service failed with error: iscsi_service: socket error, connection lost.. Reestablish connection to the target iqn.2008-08.com.starwindsoftware:s001.clr001.syso.local-ssd
10/11 15:20:47.459 1b98 Sw: *** MountTarget: Failed to log in to target(iqn.2008-08.com.starwindsoftware:s001.clr001.syso.local-ssd). Error message: iscsi_service: socket error, connection lost..
10/11 15:20:47.459 1b98 HA: CNIXInitiator::MountTarget: unable to mount the target (1223)!
10/11 15:20:47.640 5d8 Sw: *** iscsi_tcp_dispatch: iscsi_service failed with error: iscsi_service: socket error, connection lost.. Reestablish connection to the target iqn.2008-08.com.starwindsoftware:192.168.0.100-witness
10/11 15:20:47.640 5d8 Sw: *** MountTarget: Failed to log in to target(iqn.2008-08.com.starwindsoftware:192.168.0.100-witness). Error message: iscsi_service: socket error, connection lost..
10/11 15:20:47.640 5d8 HA: CNIXInitiator::MountTarget: unable to mount the target (1223)!
10/11 15:20:47.862 2044 Sw: *** iscsi_tcp_dispatch: iscsi_service failed with error: iscsi_service: socket error, connection lost.. Reestablish connection to the target iqn.2008-08.com.starwindsoftware:s001.clr001.syso.local-coldstorage
10/11 15:20:47.862 2044 Sw: *** MountTarget: Failed to log in to target(iqn.2008-08.com.starwindsoftware:s001.clr001.syso.local-coldstorage). Error message: iscsi_service: socket error, connection lost..
10/11 15:20:47.862 2044 HA: CNIXInitiator::MountTarget: unable to mount the target (1223)!
10/11 15:20:48.279 2fe0 HA: CHAPartnerISCSIChannelManager::SendCustomControlScsiCommand: Valid channel not found!
Any thoughts on what is going on?

Re: Replication and Hearthbeat channel down?

Posted: Wed Oct 11, 2017 3:19 pm
by Boris (staff)
jortie,

If you select a device, right-click it and go to Replication Node Interfaces - try deleting one heartbeat interface with a red cross and re-add it again. Also, what are your NICs (vendor, model) and are Jumbo Frames (MTU) enabled on iSCSI/Sync network?

Re: Replication and Hearthbeat channel down?

Posted: Wed Oct 11, 2017 6:22 pm
by jortie
Readding didnt make a difference. Once I readd the interface immediately shows a red cross. I have jumbo frames enabled on the replication nics. The heartbeat doenst have jumbo frames enabled.

Replication nics are Intel X540
Hearbeat is Broadcom Nextreme gigabit. I just disabled ethernet@wirespeed. It doenst make a difference though.

Re: Replication and Hearthbeat channel down?

Posted: Thu Oct 12, 2017 8:08 am
by Boris (staff)
In certain cases disabling Jumbo frames helps to solve this issue. Please try disabling Jumbo frames on both nodes and report the links behavior.

Re: Replication and Hearthbeat channel down?

Posted: Thu Oct 12, 2017 8:50 am
by jortie
On one of the Intel nics jumbo frames were enabled. On the other one not. Jumbo frames are now disabled on both nodes. The Broadcom nics both had already a MTU of 1500.
I deleted all interfaces and add them again. They instantly showed up with a red cross

Re: Replication and Hearthbeat channel down?

Posted: Thu Oct 12, 2017 10:41 am
by Boris (staff)
Could you show me some screenshots of how the same device looks on both servers in terms of Replication Node Interfaces?

Re: Replication and Hearthbeat channel down?

Posted: Thu Oct 12, 2017 11:43 am
by jortie
Here we go

Re: Replication and Hearthbeat channel down?

Posted: Thu Oct 12, 2017 11:43 am
by jortie
And the last one

Re: Replication and Hearthbeat channel down?

Posted: Thu Oct 12, 2017 12:04 pm
by Boris (staff)
Thank you.
In fact, I requested Replication Node Interfaces, but not Replication Manager. Could you post that for at least HAimage1 from both servers?

Re: Replication and Hearthbeat channel down?

Posted: Thu Oct 12, 2017 3:04 pm
by jortie
Sorry about that.

Re: Replication and Hearthbeat channel down?

Posted: Thu Oct 12, 2017 6:44 pm
by Ivan (staff)
Hello jortie,
Could you please share the screenshot from "Network" tab on each StarWind node like on the picture below?
Thank you.

Re: Replication and Hearthbeat channel down?

Posted: Thu Oct 12, 2017 11:21 pm
by jortie
Here you go!

Re: Replication and Hearthbeat channel down?

Posted: Fri Oct 13, 2017 11:53 am
by Ivan (staff)
Hello jortie,
Thank you for screenshots.
Could you please check "Windows features and Roles" and remove the features like on the screenshot below (if exist) and reboot the server.
Thank you.

Re: Replication and Hearthbeat channel down?

Posted: Fri Oct 13, 2017 4:31 pm
by jortie
They are not installed on both nodes.

Re: Replication and Hearthbeat channel down?

Posted: Fri Oct 13, 2017 4:40 pm
by Ivan (staff)
Hello jortie,
Thank you for your reply.
Did you try to reboot the second node?
Could you please collect the logs and share it with us?
For quicker and easier log collection from StarWind nodes please do not hesitate using the script from our knowledge base article below:
https://knowledgebase.starwindsoftware. ... collector/
You can upload the collected logs to any cloud (dropbox, google drive, OneDrive, etc.) and share the link for download.