Flood of errors: section is locked by dist sharer ID 0xD2

nicholas.dale · Mon Jun 21, 2021 9:19 am

I am using StarWind Virtual SAN for VSphere in a two node setup. Both nodes are running as appliances, with RAID'd SSDs configured with a VDO on top. The servers are connected over 10gig fibre for synch and ethernet links for the heartbeat.
Last night, there was a flood of errors across the setup just after midnight beginning with

Code: Select all

DistCs::Enter: (CSid = 0xCCC5) section is locked by dist sharer ID 0xD2.

Following this, there were many instances of

Code: Select all

DistCs::Enter: (CSid = 0xCCC5) retry #1 after sleep 152 ms...

and various iSCSI disconnections and reconnections.
This caused the CPU usage of one of the nodes to skyrocket, and issues with the VMs as the SAN appeared to flip on and off.
The other node also had errors in the logs but these seemed to be more related to the primary issues.
I have checked the health of the
After 5 hours this behaviour appeared to resolve itself, with both nodes and VSAN performance returning to normal.
Does anyone here have an idea as to why this might occur?
I have log files available for further investigation.

Tue Jun 22, 2021 4:20 am

Welcome to StarWind Forum.
Please note that insufficient CPU or/and RAM resources could result in service hangs. Could you please tell me which build you are running?
Also, how were you able to determine iSCSI disconnects?

Finally, please share the logs with me. I need those to be collected via StarWind Log Collector https://knowledgebase.starwindsoftware. ... collector/

nicholas.dale · Sun Jun 27, 2021 3:16 pm

Hi there,
The build on both nodes is 13861. The iSCSI issues were diagnosed from VCenter logs:

Code: Select all

Lost access to volume 609b9cb5-0c006c2e-9af7-74867aee68f6 (VDISK01) due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly.
 06/21/2021, 5:12:05 AM
Lost path redundancy to storage device eui.8be9275f38627186. Path vmhba66:C0:T0:L0 is down. Affected datastores: VDISK01.
06/21/2021, 5:12:05 AM
Lost connectivity to storage device eui.8be9275f38627186. Path vmhba66:C0:T1:L0 is down. Affected datastores: VDISK01.

In particular, the following messages kept looping:

Code: Select all

Lost access to volume 609b9cb5-0c006c2e-9af7-74867aee68f6 (VDISK01) due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly
Successfully restored access to volume 609b9cb5-0c006c2e-9af7-74867aee68f6 (VDISK01) following connectivity issues.

Both nodes are running as VMs with 4vCPUs and 8GB of RAM each.

I could not attach the files directly as they were too big, so I'm providing links below:
https://cloud.riplad.com/index.php/s/tc94LA9wTdd9ng2
https://cloud.riplad.com/index.php/s/TGc3aA2qaRdPeoE

Let me know if there is anything else you need.
Thanks for the assistance.

Tue Jun 29, 2021 1:13 pm

Thank you. Will check the logs tomorrow.

nicholas.dale · Tue Jun 29, 2021 11:34 pm

Thank you. As an update, the issue happened again. Similar error messages, increasing IOWait/alerts on VMs. Also around 12am, not sure what could trigger it at that time though.

Thu Jul 01, 2021 6:25 am

Thanks for your patience.
There could be network connectivity issues. Make sure to have MTU=1500 in ESXi for VMKs and vSwitches. Also please be aware that any sort of teaming or any other network aggreagion are not supported https://www.starwindsoftware.com/best-p ... practices/.
Node2
6/21 5:00:41.382341 8b HA: Ssc_Request_Task::complete: Warning: sscRequestTask(0x0000000025CA2A90) Funciton(Execute SCSI Command) OpCode(0x42) ex(0x0) DiffTimeCompleteINIT = 3431 ms, DiffTimeCompleteEXEC = 3431 ms. Target: 'iqn.2008-08.com.starwindsoftware:10.0.1.61-vsan1'
6/21 5:00:42.842221 61 HA: Ssc_Request_Task::complete: Warning: sscRequestTask(0x000000002612B130) Funciton(Execute SCSI Command) OpCode(0xF3) ex(0x50) DiffTimeCompleteINIT = 3562 ms, DiffTimeCompleteEXEC = 3562 ms. Target: 'iqn.2008-08.com.starwindsoftware:10.0.1.61-vsan1'
6/21 5:00:44.572639 8b HA: Ssc_Request_Task::complete: Warning: sscRequestTask(0x0000000025CA2A90) Funciton(Execute SCSI Command) OpCode(0x42) ex(0x0) DiffTimeCompleteINIT = 3190 ms, DiffTimeCompleteEXEC = 3190 ms. Target: 'iqn.2008-08.com.starwindsoftware:10.0.1.61-vsan1'
6/21 5:00:45.869353 61 HA: Ssc_Request_Task::complete: Warning: sscRequestTask(0x000000002612B130) Funciton(Execute SCSI Command) OpCode(0xF3) ex(0x50) DiffTimeCompleteINIT = 3020 ms, DiffTimeCompleteEXEC = 3020 ms. Target: 'iqn.2008-08.com.starwindsoftware:10.0.1.61-vsan1'

Node2 again.
These events are fine
6/21 5:12:07.315933 b2 HA: DistSrwLock::EnterWriterOnTheDistributeSharersSide: (SWMRid = 0xCCCE) ENTER command on dist sharer(Id = 0xD2) can't be complete. Sharer return BUSY.
6/21 5:12:07.315953 b2 HA: DistSrwLock::EnterWriter: (SWMRid = 0xCCCE) retry #35 after sleep 63 ms..
They just say that destination is busy doing something.

Node1
This may be VAAI CAW hangs.
6/21 5:11:22.263999 1d5 HA: HANode::completeSscCommandRequest: (0x89) CHECK_CONDITION , sense: 0xe 0x1d/0x0 returned.
Disable VAAI CAW and ATS inside ESXi https://kb.vmware.com/s/article/2146451.
If that does not help, disable it by editing the config file
1. Make sure HA devices are synchronized and the storage is available.
2. Stop StarWindVSA service on one VM.
3. Go to /opt/StarWind/StarWindVSA/drive_c/StarWind
4. Copy StarWind.cfg.
5. Open StarWind.cfg.
6. Locate <VaaiCawEnabled value="yes"/>
7. Change to <VaaiCawEnabled value="no"/>
6. Save and exit.
7. Start the service.
8. Repeat all the steps for another VM.