Resync failing after a while

Software-based VM-centric and flash-friendly VM storage + free version

Moderators: anton (staff), art (staff), Max (staff), Anatoly (staff)

Post Reply
Tux-87
Posts: 4
Joined: Tue Mar 14, 2023 11:11 pm

Tue Mar 14, 2023 11:29 pm

I have a two node vsan free cluster. One node is synchronised and exposing it's targets.
The other node is not synchronised, and I can't get it to sync again after it crashed.
It has been working just fine before.

Sync starts, I see 5 Gbps of traffic over the synchronisation link and the sync percentage increases. One time I have seen it go over 60% before failing.
The node that is in sync logs the following:

Code: Select all

3/14 22:22:38.189654 208 HA: HANode::handleCompletionCustomWriteSendRequest: Information about partner node('iqn.2008-08.com.starwindsoftware:stw04-partnercv4') was obtained: active = 0, syncStatus = 2, synchronizerPriority = 0x0, syncCount = 1, hbCount = 1, storageTaskTimeout = 0, lastPerfDegEventElapsedTimeMs = -1, priority = 1.
3/14 22:22:38.189674 2950 func: >>> AsyncTaskCallback::execute
3/14 22:22:38.189681 1748 Common: *** `anonymous-namespace'::iSCSIAsyncCommandCallback: User request 000001640EF9E898 was cancelled.
3/14 22:22:38.189709 1748 HA: *** CReadWriteSyncTask::partner_request_completed: Remote write(async) failed, SystemError = 0x4C7, ScsiStatus = 0x0, SenseCode = 0x0, AdSenseCode = 0x0, AdSenseQualifier = 0x0
3/14 22:22:38.189725 1748 HA: CReadWriteSyncTask::complete_task: The 0x00000165BB8F46A0 synchronization task has been failed!
3/14 22:22:38.189692 2950 HA: HANode::handleCompletionCustomWriteSendRequest: Replication write to partner(iqn.2008-08.com.starwindsoftware:stw04-partnercv4) has been failed, will try to check if we have storage performance degradation on the partner side.
3/14 22:22:38.189747 1748 Common: *** `anonymous-namespace'::iSCSIAsyncCommandCallback: User request 0000016343173B68 was cancelled.
3/14 22:22:38.189787 1748 HA: *** CReadWriteSyncTask::partner_request_completed: Remote write(async) failed, SystemError = 0x4C7, ScsiStatus = 0x0, SenseCode = 0x0, AdSenseCode = 0x0, AdSenseQualifier = 0x0
3/14 22:22:38.189803 1748 HA: CReadWriteSyncTask::complete_task: The 0x00000165BB8F5500 synchronization task has been failed!
3/14 22:22:38.189931 1748 Common: *** iscsi_tcp_dispatch: iscsi_service failed with error: iscsi_service: partially sent PDU has expired, terminating connection.. Reestablish connection to the target iqn.2008-08.com.starwindsoftware:stw04-partnercv4
3/14 22:22:38.189940 3060 HA: HANode::handleCompletionCustomWriteSendRequest: Information about partner node('iqn.2008-08.com.starwindsoftware:stw04-partnercv4') was obtained: active = 0, syncStatus = 2, synchronizerPriority = 0x0, syncCount = 1, hbCount = 1, storageTaskTimeout = 0, lastPerfDegEventElapsedTimeMs = -1, priority = 1.
3/14 22:22:38.189978 3060 HA: *** HA_Partner_iScsi_Device::send_sync_custom_write_scsi_command: EXITing with failure(failed), target is not mounted!
3/14 22:22:38.189978 3488 Common: *** SendCommand: unable to execute iscsi command. Error message: Trying to send command while not logged in..
3/14 22:22:38.190024 3060 func: >>> HASyncNode::solvePartnerCmdSendFailure
3/14 22:22:38.190062 3530 Common: *** SendCommand: unable to execute iscsi command. Error message: Trying to send command while not logged in..
3/14 22:22:38.190092 a80 Common: *** SendCommand: unable to execute iscsi command. Error message: Trying to send command while not logged in..
3/14 22:22:38.190113 a80 HA: *** CNIXInitiator::SendCommand: unable to send a command (1627)!
3/14 22:22:38.190135 4670 Common: *** SendCommand: unable to execute iscsi command. Error message: Trying to send command while not logged in..
3/14 22:22:38.190158 2950 HA: HANode::handleCompletionCustomWriteSendRequest: Information about partner node('iqn.2008-08.com.starwindsoftware:stw04-partnercv4') was obtained: active = 0, syncStatus = 2, synchronizerPriority = 0x0, syncCount = 1, hbCount = 1, storageTaskTimeout = 0, lastPerfDegEventElapsedTimeMs = -1, priority = 1.
3/14 22:22:38.190030 b6c Common: *** SendCommand: unable to execute iscsi command. Error message: Trying to send command while not logged in..
3/14 22:22:38.190080 29d8 Common: *** SendCommand: unable to execute iscsi command. Error message: Trying to send command while not logged in..
3/14 22:22:38.190099 3530 HA: *** CNIXInitiator::SendCommand: unable to send a command (1627)!
3/14 22:22:38.190046 51b8 Common: *** SendCommand: unable to execute iscsi command. Error message: Trying to send command while not logged in..
It seems to me the connection drops between the two. I checked the crosslink for errors, to find nothing.
Underlying storage is healthy and performing normal (e.g. over 10k iops and just south of 1 Gbps of throughput).

I have tried to eleminate the crosslinks by taking one of two out, same result. Rebooted the failing node.

Can someone point me in a direction where to troubleshoot further?
yaroslav (staff)
Staff
Posts: 2361
Joined: Mon Nov 18, 2019 11:11 am

Wed Mar 15, 2023 12:29 pm

Can I have the logs from both servers collected with StarWind log collector, please?
See more info at https://knowledgebase.starwindsoftware. ... collector/.
Tux-87
Posts: 4
Joined: Tue Mar 14, 2023 11:11 pm

Wed Mar 15, 2023 1:36 pm

Sure, where can I send them?
yaroslav (staff)
Staff
Posts: 2361
Joined: Mon Nov 18, 2019 11:11 am

Wed Mar 15, 2023 1:50 pm

Hi,

You can use any convenient file transfer method, e.g., Google Drive, Sharepoint, Onedrive, etc.
Tux-87
Posts: 4
Joined: Tue Mar 14, 2023 11:11 pm

Wed Mar 15, 2023 3:21 pm

I PM'ed you the links with the full logs
Tux-87
Posts: 4
Joined: Tue Mar 14, 2023 11:11 pm

Thu Mar 16, 2023 9:23 am

I got the nodes back in sync.

Let me post the findings, in case someone hits the same issue. It turns out to be network related.
We are using the HPE 562flr-sfp+ flexLOM NIC, which is Intel X710 based. We should have used a Mellanox card, but PCI slots are limited on the Gen10 platform.
The MTU was left to default, because the opinions about Jumbo frames tend to differ. The driver has a default MTU of 1534 bytes. This gives a net MTU of 1492 bytes.

I increased the MTU on the crosslink to 9000 on both nodes (both on the driver, and in Windows using netsh). This gives a net MTU of 8958 bytes on the crosslink.
The sync went through just fine after this change in just one go.
yaroslav (staff)
Staff
Posts: 2361
Joined: Mon Nov 18, 2019 11:11 am

Thu Mar 16, 2023 9:49 am

Thanks for getting back to me :)
Just a small remark from my end: header size is 14B. That's why the default values for Windows are 9014 and 1514, while, Linux has 9000 and 1500. Yes, they might vary depending on the drivers in Windows (I saw 9000 and 1500 a while ago to be default values).
Based on my experience Intel x710, x720, x810, and Broadcom NICs are more likely to throw troubles under MTU 9000 or 9014. Reason I believe: Firmware and drivers do not play well together.

P.S. Look for C2P events in service logs if you suspect the networking. See the sample event below.
HA: HAPartneriScsiChannel::permanentConnectionPollingThread: Event: Failed connection(C2P) - Device(0x0000000002817D98), IP(10.163.115.3), InitiatorName(iqn.2008-08.com.starwindsoftware:<target-name>), Target(iqn.2008-08.com.starwindsoftware:<target-name>)!
Post Reply