VSAN L1 Write-Back Cache Behavior

glecit · Tue May 19, 2020 4:41 pm

We're running Starwind VSAN 8 build 13182 on a 2-node HA cluster. We have 4 CSVs on separate flat images, all of which have an L1 write-back cache of 1-5GB and no L2 cache.

CSV4 is backed by a slow spinning array and has a 5GB L1 write-back cache. Our workload for CSV4 consists of bursts of 1-2 GB of sequential writes every 15-30 minutes and mostly sequential reads of 1-4GB at random times, with 25-50% of those reads from the last 5GB written. Based on performance, it appears that all writes to CSV4 are done at the underlying array's speed (slow), as if the cache were full or in write-through mode. In the Starwind console, the cache "Usage" figure for this image (and all CSVs) always shows 100%. I'm wondering if the cache is never being flushed, or if it's being flushed, the memory is never marked as free for future writes and it's only flushing when necessary. It looks like it's configured to flush after 5 seconds. I checked the Starwind service logs, but didn't see any problems related to caching. I've not yet tested whether the other CSVs are also performing as if in write-through mode.

My questions are:

Is this a known issue or might we be doing something wrong? I'm happy to pm a Starwind log collector zip if this requires investigation.
What are the criteria for the lazy cache writer to flush?
When write-back caching is working correctly, how are flushing writes prioritized relative to reads when the cache is not full?
1. Example: If I have a free 5GB cache, write 2GB to it, which is almost immediately acked to the client, and then try to read non-cached data while the 2GB is still flushing to my slow array, will VSAN prioritize my read?
2. Will it pause the flush while disk activity is high?
3. How does it prioritize flushing when the cache is full?

Thanks very much for any insight.

Wed May 20, 2020 9:18 am

Hi,

Thank you for your question.
Could you please update to the latest build and see if the problem persists? Download it at https://www.starwindsoftware.com/tmplin ... ind-v8.exe. See the general update procedure at https://knowledgebase.starwindsoftware. ... d-version/. Please flush the cache before stopping the service.
You can flush the cache with FlashCacheAll script in StarWindX.

Could you share the logs with me here before you start the update procedure? Use google drive please for that purpose.

What you can do is setting disk shares to 4000 for the VMDK.

Just a couple of questions to understand better the setup
1. Is it a thick provision eager zeroed VMDK?
2. What is the underlying storage configuration?

wallewek · Thu Nov 05, 2020 5:32 pm

A related question on this topic: What the dirty cache flushing behaviour for the L1 algorithm in write-back mode?

Specifically, is there any sort of automated timeout for dirty cache blocks, such that they are flushed out of cache after a certain maximum time period? Or can dirty cache blocks sit there forever unless someone manually triggers a flush.

If there is auto-flush, what is that time period, and is it configurable, e.g., via powershell or editing files. Or is all dirty cache simply flushed as soon as the related physical storage device write queue is below a certain level?

--- kenw

Fri Nov 06, 2020 11:53 am

Good question.
Cache is flushed if there are no i/o to the device for some time and if there are dirty blocks.
Cache is flushed on the event of write if all blocks are dirty.
Cache is flushed if the device is removed or StarWind Service is stopped.

Is your setup still doing fine, btw?

wallewek · Fri Nov 06, 2020 10:48 pm

Thanks, Yaroslav. I was kind of hoping for a better idea what "some time" is, in the "no i/o to the device for some time". I mean, are we talking one second, one minute...?

I though it was running well, so I enabled Windows Backup in Task Scheduled on one host. Things did not go well. Not sure what happened, still investigating. Interestingly, all 5 channels in both directions, except for one channel on one witness, reported down (took screenshots). Restarting both VSAN services, the channels came back up, but I had to force synchronized on one host.

I'll pull snapshots.

Oh, just to be clear, I did NOT enable any VSAN caching. Just thinking.

-- kenw

Sat Nov 07, 2020 10:42 am

Hi,
We are talking about 5s. Thanks for the additional info. Please add logs too.

wallewek · Sat Nov 07, 2020 6:21 pm

Thank you Yaroslav, that's helpful!

The logs are there, have PMed you.

--- kenw

Thu Mar 04, 2021 1:24 pm

Did you resolve the issue so far? Was our support helpful?

Thu Mar 04, 2021 2:02 pm

Hello Anton,

Yes, the problem was resolved: there was a key issue and both nodes were paused in the cluster. All was looking good by the time we closed the case.