CentOS 6.6 - LUN dropouts

craggy · Mon Jun 22, 2015 10:14 am

We have moved several CentOS 6.6 servers acting as Xen hosts from a Nexenta storage server to a Starwind server but are experiencing serious issues with dropouts when we put certain load on the storage system.

We use 10Gbe nics with multipath to Starwind and the storage generally connects fine and performs ok but once we push load like migrating a VMs disk or provisioning a new VM all the CentOS servers lose connection to the storage server.
Eventually the storage coes back after a few minutes or by restarting the iscsi service on the CentOS servers.

Everything is configured correctly end to end with Jumbo frames and the same iScsi set up worked flawlessly for 3 years on Nexenta.

One thing to note is the same Starwind server is also presenting storage to about 10 VMs on ESXi and none of the VMware hosts or VMs experience any dropputs at the same time as the CentOS servers do.

Where would we even start with troubleshooting this?

craggy · Mon Jun 22, 2015 11:27 am

We also see lots of these in the CentOS logs

Jun 22 12:21:16 hv02 multipathd: sda: alua not supported
Jun 22 12:21:16 hv02 multipathd: sdb: alua not supported
Jun 22 12:21:56 hv02 multipathd: sda: alua not supported
Jun 22 12:21:56 hv02 multipathd: sdb: alua not supported
Jun 22 12:22:36 hv02 multipathd: sda: alua not supported
Jun 22 12:22:36 hv02 multipathd: sdb: alua not supported

Maybe we need to change our multipath.conf settings a bit for Starwind as the defaults just worked out of the box for Nexenta.

Jhon_Smith · Mon Jun 22, 2015 12:20 pm

Try following this guide:
https://www.starwindsoftware.com/starwi ... erver-pool

craggy · Mon Jun 22, 2015 1:03 pm

Cheers.

Does it matter that this guide is for XenServer as opposed to Xen on top of CentOS?

craggy · Mon Jun 22, 2015 4:45 pm

I've followed the guide but it didn't help.

It turns out that even with multipathing disabled we experience the same dropputs.

We see a lot of these in the Starwind logs:

6/22 17:35:44.333 2010 PR: Set Unit attention 0x29/0x3 (0x0) for session 0xe6 from iqn.1994-05.com.redhat:962c39873de,00023D050000.
6/22 17:35:44.333 2010 Tgt: iqn.2008-08.com.starwindsoftware:win-pvc0ijv0c1n-onapp LUN 0: abort tasks - session [e7] not found!
6/22 17:35:44.333 2010 PR: Set Unit attention 0x29/0x3 (0x0) for session 0xe7 from iqn.1994-05.com.redhat:962c39873de,00023D060000.
6/22 17:35:44.333 2010 Tgt: iqn.2008-08.com.starwindsoftware:win-pvc0ijv0c1n-onapp LUN 0: abort tasks - session [114] not found!
6/22 17:35:44.333 2010 PR: Set Unit attention 0x29/0x3 (0x0) for session 0x114 from iqn.1994-05.com.redhat:727a27c9133a,00023D070000.
6/22 17:35:44.333 2010 Tgt: iqn.2008-08.com.starwindsoftware:win-pvc0ijv0c1n-onapp LUN 0: abort tasks - session [111] not found!

darklight · Fri Jun 26, 2015 10:21 am

Hi craggy!

Jun 22 12:21:16 hv02 multipathd: sda: alua not supported

This string reminds me on ALUA settings during the initial creation of the StarWind device. Is it possible that you changed something there while creating your devices. ALUA should definitely be enabled i believe.

craggy · Tue Jun 30, 2015 9:51 pm

Hi Darklight

I just created a standard target with clustering enabled.
Then created a thick volume and added it to the target, nothing unusual other than that.

One thing I did do is set access rights so that the target is exposed to specific servers.

darklight · Wed Jul 01, 2015 12:42 pm

Looks like the initiator (in your case Xen or CentOS) is dropping the connections to StarWind targets. So you have to check the logs on that side, maybe playing with multipath policies (Round Robin, Last Active and so on).

Also consider to temporary disable the access rights too for testing.

craggy · Wed Jul 01, 2015 10:41 pm

I've disabled the access rights but didnt help at all. I've also tried disabling multipath and playing around with timings and policies etc. but made no differennce.

What i have found though interetingly is that when L2 cache is enabled on the LUN in Starwind the lun drops out on the HV side during the provisioning of a Windows VM.
If I just leave L1 cache enabled and no L2 then it works perfectly with no drops.

I've tried 3x different types of L2 cache, 2x SSDs and a PCIe mSata card with an SSD to rule out the SATA ports. Same issue with all 3 devices.
Worth mentioning that the same SW server and L2 cache enabed our ESXi hosts have no issues and performance is excellent.

Also, we have determined that this dropout issue is specifically when a Windows VM is either being provisioned or having storage migrated from one LUN to another.
We have not been able to cause any dropouts by pushing heavy continuous load from a Windows VM or by provisioning a Linux VM. It's specifically related to windows VMs.

craggy · Thu Jul 02, 2015 9:50 am

Ok, after watching many log files we have found the specific point at which the lun drops.

When we are provisioning a Windows VM this command is issued to clone a .img file to the NTFS formatted LVM.
ntfsclone -q -r -O /dev/mapper/onapp--rb6q3prso1b17x-m29r7t1vde03rsX1 /onapp/templates/win12_x64_std_r2-ver3.6.img

The copy works perfectly at aroung 110MBs with no Lun drops.

ntfsclone v2013.1.13 (libntfs-3g)
Ntfsclone image version: 10.0
Cluster size : 4096 bytes
Image volume size : 21378363392 bytes (21379 MB)
Image device size : 21378367488 bytes
Space in use : 16785 MB (78.5%)
Offset to image data : 56 (0x38) bytes
Restoring NTFS from image ...
Warning : no alternate boot sector in image

Once the copy completes ntfsclone does a sync, we see this in the log:

Syncing ...

At this specific point the Lun drops on the HVs (or the Lun gets locked, i'm not sure)

Running pvscan or similar on the HVs results in a hang and other VMs on the HVs also hang.
After a few minutes the syncing completes and usually restarting the iScsi service on the HVs brings the Lun back online.

As I said above, disabling L2 cache on the Lun resolves the issue but we need cache because of the amount of VMs that will be running.

craggy · Thu Jul 02, 2015 11:14 am

Update:

After more testing we have found that enabling L2 cache but switching it to Writh-Through instead of Write-Back also fixes the dropout issue.

So to summarize, when the "syncing" stage of ntfsclone happens, if the L2 cache is in WB mode the lun drops for the duration of time it takes to sync. After the sync completes, restarting the iScsi service on the HVs brings the lun back online.
If I try to restart the iScsi service while the sync is in progress, the Lun is blocked and the service restart hangs.

craggy · Sat Jul 04, 2015 12:02 am

Any SW staff want to chime in on this one?

Tue Jul 07, 2015 12:10 pm

Hello craggy,

We do not recommend that you use L2 cache in WB mode in any case.

Our devs are going to change cache flush algorithm and we expect that this issue will be resolved in the upcoming release.

Thank you for your contribution.