Resync failing after a while
Posted: Tue Mar 14, 2023 11:29 pm
I have a two node vsan free cluster. One node is synchronised and exposing it's targets.
The other node is not synchronised, and I can't get it to sync again after it crashed.
It has been working just fine before.
Sync starts, I see 5 Gbps of traffic over the synchronisation link and the sync percentage increases. One time I have seen it go over 60% before failing.
The node that is in sync logs the following:
It seems to me the connection drops between the two. I checked the crosslink for errors, to find nothing.
Underlying storage is healthy and performing normal (e.g. over 10k iops and just south of 1 Gbps of throughput).
I have tried to eleminate the crosslinks by taking one of two out, same result. Rebooted the failing node.
Can someone point me in a direction where to troubleshoot further?
The other node is not synchronised, and I can't get it to sync again after it crashed.
It has been working just fine before.
Sync starts, I see 5 Gbps of traffic over the synchronisation link and the sync percentage increases. One time I have seen it go over 60% before failing.
The node that is in sync logs the following:
Code: Select all
3/14 22:22:38.189654 208 HA: HANode::handleCompletionCustomWriteSendRequest: Information about partner node('iqn.2008-08.com.starwindsoftware:stw04-partnercv4') was obtained: active = 0, syncStatus = 2, synchronizerPriority = 0x0, syncCount = 1, hbCount = 1, storageTaskTimeout = 0, lastPerfDegEventElapsedTimeMs = -1, priority = 1.
3/14 22:22:38.189674 2950 func: >>> AsyncTaskCallback::execute
3/14 22:22:38.189681 1748 Common: *** `anonymous-namespace'::iSCSIAsyncCommandCallback: User request 000001640EF9E898 was cancelled.
3/14 22:22:38.189709 1748 HA: *** CReadWriteSyncTask::partner_request_completed: Remote write(async) failed, SystemError = 0x4C7, ScsiStatus = 0x0, SenseCode = 0x0, AdSenseCode = 0x0, AdSenseQualifier = 0x0
3/14 22:22:38.189725 1748 HA: CReadWriteSyncTask::complete_task: The 0x00000165BB8F46A0 synchronization task has been failed!
3/14 22:22:38.189692 2950 HA: HANode::handleCompletionCustomWriteSendRequest: Replication write to partner(iqn.2008-08.com.starwindsoftware:stw04-partnercv4) has been failed, will try to check if we have storage performance degradation on the partner side.
3/14 22:22:38.189747 1748 Common: *** `anonymous-namespace'::iSCSIAsyncCommandCallback: User request 0000016343173B68 was cancelled.
3/14 22:22:38.189787 1748 HA: *** CReadWriteSyncTask::partner_request_completed: Remote write(async) failed, SystemError = 0x4C7, ScsiStatus = 0x0, SenseCode = 0x0, AdSenseCode = 0x0, AdSenseQualifier = 0x0
3/14 22:22:38.189803 1748 HA: CReadWriteSyncTask::complete_task: The 0x00000165BB8F5500 synchronization task has been failed!
3/14 22:22:38.189931 1748 Common: *** iscsi_tcp_dispatch: iscsi_service failed with error: iscsi_service: partially sent PDU has expired, terminating connection.. Reestablish connection to the target iqn.2008-08.com.starwindsoftware:stw04-partnercv4
3/14 22:22:38.189940 3060 HA: HANode::handleCompletionCustomWriteSendRequest: Information about partner node('iqn.2008-08.com.starwindsoftware:stw04-partnercv4') was obtained: active = 0, syncStatus = 2, synchronizerPriority = 0x0, syncCount = 1, hbCount = 1, storageTaskTimeout = 0, lastPerfDegEventElapsedTimeMs = -1, priority = 1.
3/14 22:22:38.189978 3060 HA: *** HA_Partner_iScsi_Device::send_sync_custom_write_scsi_command: EXITing with failure(failed), target is not mounted!
3/14 22:22:38.189978 3488 Common: *** SendCommand: unable to execute iscsi command. Error message: Trying to send command while not logged in..
3/14 22:22:38.190024 3060 func: >>> HASyncNode::solvePartnerCmdSendFailure
3/14 22:22:38.190062 3530 Common: *** SendCommand: unable to execute iscsi command. Error message: Trying to send command while not logged in..
3/14 22:22:38.190092 a80 Common: *** SendCommand: unable to execute iscsi command. Error message: Trying to send command while not logged in..
3/14 22:22:38.190113 a80 HA: *** CNIXInitiator::SendCommand: unable to send a command (1627)!
3/14 22:22:38.190135 4670 Common: *** SendCommand: unable to execute iscsi command. Error message: Trying to send command while not logged in..
3/14 22:22:38.190158 2950 HA: HANode::handleCompletionCustomWriteSendRequest: Information about partner node('iqn.2008-08.com.starwindsoftware:stw04-partnercv4') was obtained: active = 0, syncStatus = 2, synchronizerPriority = 0x0, syncCount = 1, hbCount = 1, storageTaskTimeout = 0, lastPerfDegEventElapsedTimeMs = -1, priority = 1.
3/14 22:22:38.190030 b6c Common: *** SendCommand: unable to execute iscsi command. Error message: Trying to send command while not logged in..
3/14 22:22:38.190080 29d8 Common: *** SendCommand: unable to execute iscsi command. Error message: Trying to send command while not logged in..
3/14 22:22:38.190099 3530 HA: *** CNIXInitiator::SendCommand: unable to send a command (1627)!
3/14 22:22:38.190046 51b8 Common: *** SendCommand: unable to execute iscsi command. Error message: Trying to send command while not logged in..
Underlying storage is healthy and performing normal (e.g. over 10k iops and just south of 1 Gbps of throughput).
I have tried to eleminate the crosslinks by taking one of two out, same result. Rebooted the failing node.
Can someone point me in a direction where to troubleshoot further?