Corruption?

janderson133 · Sun Nov 13, 2022 7:47 am

Hello,

I have a 3 node cluster using the free vsa (Node3 is the witness node). The nodes are running the latest version of starwind. There was some networking issues that caused the nodes to go offline. Node 2 shows synchronized. When it tries to synchronize with node 1 is fails at 6%. I also noticed that when I try to vmotion VMs off the iSCSI target the task will fail with the same message as the synchronization message. Here are the logs:

11/12 22:57:05.677468 2f debug: *** Swn_CheckCompletions: io_event for ov 0000000001ED5C10, res 5, err 31!
11/12 22:57:05.677502 2f IMG: *** ImageFile_IoCompleted: Disk operation failed. Disk path: C:\StarWind\storage\mnt\disk0\vmhost1-vmhost2-4TB-SSD-1.img. Error code: (31).
11/12 22:57:05.677513 2f func: >>> iScsiServer::sendNotification
11/12 22:57:05.677517 2f func: >>> CEventDataBase::AddRecord
11/12 22:57:05.677523 2f EventDB: CEventDataBase::AddRecord: Code 520, severity 3, additional strings: 2
11/12 22:57:05.677537 2f func: <<< CEventDataBase::AddRecord
11/12 22:57:05.677540 2f Srv: iScsiServer::sendNotification: Send reaction start (reaction type 2, msg code 520).
11/12 22:57:05.677552 2f Srv: iScsiServer::sendNotification: Send reaction finish.
11/12 22:57:05.677555 2f Srv: iScsiServer::sendNotification: Send reaction start (reaction type 1, msg code 520).
11/12 22:57:05.677658 2f Srv: iScsiServer::sendNotification: Send reaction finish.
11/12 22:57:05.677663 2f func: <<< iScsiServer::sendNotification
11/12 22:57:05.677666 2f IMG: *** ImageFile_ReadWriteSectorsCompleted: Error occured (ScsiStatus = 2, DataTransferLength = 0)!
11/12 22:57:05.677672 2f IMG: *** ImageFile_ReadWriteWithCacheCompleted: Error (0xC0000001) returned to cache request completion!
11/12 22:57:05.677677 2f Common: CStarWindStorageDevice::AsyncReadWriteCompleted: IO operation failed, scsi status 2!
11/12 22:57:05.677680 2f Common: CStarWindStorageDevice::AsyncReadWriteCompleted: SenseCode = 3, AdSenseCode = 0, AdSenseQualifier = 0
11/12 22:57:05.677684 2f HA: *** HANode::sscRequestScsiOpRead_CompleteCallback: There's no synchronized partner and read error(senseKey = 3)
11/12 22:57:05.677688 2f HA: HASyncNode::sscRequestScsiOpRead_Complete: Native read returned: ScsiStatus = 0x2, SenseCode = 0x3, AdSenseCode = 0x0, AdSenseQualifier = 0x0
11/12 22:57:05.677692 2f HA: HANode::completeSscCommandRequest: (0x28) CHECK_CONDITION , sense: 0x3 0x0/0x0 returned.

Any help appreciated!

Mon Nov 14, 2022 2:29 pm

This is a storage delay, not corruption. Could you please check the underlying storage health?
What might also help is checking the network cables and MTU to be set to 1500 on the entire synchronization network stack.

janderson133 · Mon Nov 14, 2022 5:27 pm

Thanks yaroslav

I don't think this is a storage issue. The VSA is running on an SSD inside the ESXi host. All the other VMs running on that SSD are fine. This only started happening after the networking issues. Maybe the image is damaged inside the VSA, but this would be self inflicted by the starwind software because the VSA never lost connection to its storage. I have since rebooted the VSA to see if it would come back to life. The VSA storage for the image is xfs and I haven't tried doing a repair on it yet - but I could if you think that would be worth trying.

I'm trying the sync again - storage latency is 2ms peak on the read.

Also, I see this same storage image disconnect message when doing a sync between the HA nodes and also when the other HA node is powered off and I just try coping a file from the datastore in vmware.

Any additional help would be greatly appreciate.

Thanks!

Jeff

janderson133 · Mon Nov 14, 2022 10:01 pm

yaroslav,

You might be right - I see block errors on the VSA. I'm trying to determine if this is within the VSA or the vmware SSD.

I'll keep you posted - thanks!

Jeff

Mon Nov 14, 2022 10:43 pm

Hi,

Thanks for your update. If you are using eager or lazy zeroed drives, check this KB https://knowledgebase.starwindsoftware. ... ontroller/.