VSAN falling over every 12 hours. Maxed CPU

Blowfish-IT · Sat Feb 03, 2018 10:45 am

Hi,

I wonder if this is a known issue or something new. We recently rebuilt our environment and in doing so we downloaded the lastest version of starwind.
Now this specific environment is a bit older than the others and is running on HP DL320 G6s. We have 3 of them in the cluster.

Now we originally had Xenserver 7.2 but thinking we had a problem have downgraded to 7.1LTSR.
We have built both server 2016 and server 2012R2 SAN nodes and then setup replication between 2 of them. Everything syncs fine to start with but after about 12 hours one of the CPUs cores on one of the nodes goes to 100%, we can see this in the performance manager tab in xencenter, and the system pretty much stops responding.
If we have live disks on here we start seeing multipath errors.

This only appears to have started with the lastest release of starwind and we may go back to the previous version. Is there anyone else having the same problems?
I also see there is a CPU compatibility problem currently in the Linux VSA port and wondered if they were related.

Any help would be very much appreciated. Its driving us all nuts here!

Kind Regards,
Rob Charlton

Mon Feb 05, 2018 3:00 pm

Rob,

Is there any information about the process that consumes one core up to 100% in the performance monitor? Any events in the Windows Event Viewer at approximately that time?

KevinR · Mon Feb 05, 2018 8:15 pm

While I haven't experienced complete failure of the storage system as indicated by Blowfish-IT, I have noticed that both nodes in my HA setup running under build 11818 for the last month continuously consume around 200-500MB/s heavily skewed to reading LSFS device files as you can see from the screenshot below from one of the nodes. This reading activity occurs even when all workloads (iScsi) are disconnected from the cluster (I think another poster recently indicated similar activity but I can't find the post now). This was a clean installation (archive data, full uninstall of 11456, cleanup, install 11818) with all new LSFS devices and data migrated as per the installation notes. I did not have this behavior under 11456 and the only thing different than the VSAN build is I enabled deduplication (never used it before). After restarting both nodes everything functions fine for a while but eventually after several hours, the continuous reads come back and consume more and more disk IO. The only two Starwind errors since the last restart a week ago that I can find in the windows application log are: GetQueuedCompletionStatus returned Error 6, File = '0x000000000000ADCC', and GetQueuedCompletionStatus returned Error 6, File = '0x000000000000AA88'. This cluster is running on W2k12R2 with all latest updates from MS.

I should point out that from a workload standpoint everything on the LSFS devices appears to be ok (nothing corrupt) except for sluggish performance, although a few weeks ago node 'B' ran out of space on the disk for the VSAN device files and that caused VSAN on both nodes to grind to a halt which resulted in corruption on some of the 'surviving' node LSFS devices (after restart) which I had to restore from backups (Host OSs on both nodes were still operational and no power outage was involved). All vsan devices are running in WT L1 cache (no L2/flash cache) mode and the array controllers hosting the LSFS files have 1GB flash/super-capacitor-backed WB cache). I've never seen that happen before.

: sw11818-diskusage-rs.png (214.82 KiB) Viewed 8270 times

Kevin

Blowfish-IT · Tue Feb 06, 2018 10:11 pm

So,

We have rebuilt both SAN nodes with Server 2012 R2, just to rule out any issues with 2016 (recently upgraded them) and we re-created all Xenserver hypervisor nodes with 7.1 LTSR.
Found that the issue had occurred again.
Rebuilt them again and configured the HAImage disk with 8GB RAM Cache, 200GB Flash off of SSD and 2TB SATA.
Created them as thick provisioned and allowed the 6ish hours to replicate the data between the nodes.

This time however we gave each storage VM 16GB of memory and hadn't seen this issue re-occur. Earlier today I checked the two nodes and both running with about 10GB of committed RAM with only about 20% of the page file being used (normal for Windows it seems)
I decided to drop one of the nodes from 16GB to 12GB of RAM and the issue then re-occurred almost instantly when the node restarted.
I then increased this node up to 14GB of RAM and since then I haven't seen the issue re-occur or any multipath errors.
So node 1 currently has 14GB and node 2 still has the 16GB allocated and no performance issues.

I will see how the node 1 performs over the next week or so and if no issue will probably bring node 2 down to the same 14GB.

It seems the issue occurs when the system doesn't appear to have enough free RAM and I can only guess starts paging to disk and causes the system to freeze and eventually the Starwind SAN crashes.

This is probably all down to us just being stingy on the RAM and not giving it enough, lesson learnt after many nights looking at this problem!

Hope this helps others.

Kind Regards,
Blowfish-IT

Wed Feb 07, 2018 8:40 am

Kevin, the described behavior looks like LSFS volume defragmentation. Although LSFS itself is designed to turn random writes into sequential ones, it still requires some defragmentation when the outdated journal files are processed and no longer required by the system.

Wed Feb 07, 2018 12:39 pm

Blowfish-IT,

Thank you for sharing your experience and the results of your troubleshooting.
In fact, StarWind VM or physical host needs to have at least 4GB of RAM, yet 8GB is preferable. This value does not include the size of RAM cache. Thus, it looks like your storage VMs were running out of RAM and this caused delays in responding to management commands. I have seen similar behavior on storage VMs in ESXi environments - when these VMs lacked RAM, the whole system could freeze. After enough RAM was provisioned, no issues persisted. Indeed, storage VMs need to have enough RAM to provide the whole system with stable performance.
So, I suppose the same is true for Xen hyperconverged scenarios, which is in fact not supported for production environments.