New Cluster Slow and Can't Sync

Software-based VM-centric and flash-friendly VM storage + free version

Moderators: anton (staff), art (staff), Max (staff), Anatoly (staff)

Post Reply
MeCJay12
Posts: 16
Joined: Fri May 07, 2021 1:25 pm

Tue Jul 20, 2021 12:39 am

Hello! Following my recent support request, I'm setting up new servers, one, for a speed bump, and two, to reconfigure my networking to vSAN's recommendations. The issue I'm having is in testing and preventing me from promoting to production. I'm using a pair Dell r720xd servers. Each server has Windows Server Datacenter 2016, an 8-core CPU, 128GB of RAM, 4x10G NICs (two bonded for general network/heartbeat and two bonded directly between the servers for sync), Perc H710 minis, two 120GB SSDs in RAID 1 for the OS, and 5 4TB SSDs in RAID 0 for data. The issue I'm having is that when the cluster is put under any kind of load, my vSAN HA disks become desynced and cycle between syncing and not synced every ~10 minutes. On top of that, the performance of all 3 HA disks is very poor, like VMs taking seconds to register a click and the file server taking minutes to show a folder). When the system was unloaded, I could run a disk benchmark on the raw drive or the vSAN drive that got to 2200-2600MB/s sequential reads. Once loaded the raw drive is still at the same speed but the vSAN drive drops to 800MB/s (which should be fine but obviously isn't). The three HA disks I have are: one for VMs, one for a file server, and one for a remote machine, all stored on the RAID 0. The test load I'm using is an older checkpoint of my production data/VMs so I know that the system should be able to handle it. The production system is two Dell r510s each with WS DC 2016, dual 6-core CPUs, 128GB of RAM, 2x10G NICs (bonded for network and sync with 1x1G directly between nodes for heartbeat), a Dell RAID card that I can't remember anymore, 4 500GB SSDs in RAID 5, and 6 6TB HDDs in RAID 6. This cluster has the same 3 HA disk with the VMs and OS on the SSD RAID and the other two HA drives on the HDDs. This cluster is working perfectly other than a recent VSS issue with vSAN. I have logs from the new cluster available on request. Thanks in advance!
yaroslav (staff)
Staff
Posts: 2347
Joined: Mon Nov 18, 2019 11:11 am

Tue Jul 20, 2021 3:47 am

RAID 0 for data is not recommended. See the recommended RAID setting here https://knowledgebase.starwindsoftware. ... ssd-disks/.
The issue I'm having is that when the cluster is put under any kind of load, my vSAN HA disks become desynced and cycle between syncing and not synced every ~10 minutes
What exactly type of loads are you having? Can I have the logs from your system? See this article on how to collect the logs https://knowledgebase.starwindsoftware. ... collector/.
Here is how you need to benchmark the performance https://www.starwindsoftware.com/best-p ... practices/. Please repeat the tests as described here.

Poor performance inside the VMs may be related to VMQ, RSS, and RSC settings. Please disable those on the hardware level and inside the VMs. Finally, make sure to have Fixed disks in your VMs.

Also, we do not recommend running the VMs from snapshots for a long time. Please find articles below regarding VMs snapshots:
Checkpoints and Snapshots Overview
https://docs.microsoft.com/en-us/previo ... v%3Dws.11)
Disadvantages in Hyper-V Snapshotting
https://social.technet.microsoft.com/wi ... tting.aspx
Avoid using checkpoints on a virtual machine that runs a server workload in a production environment
https://docs.microsoft.com/en-us/window ... production

Considerations:
We do not recommend using virtual machine checkpoints as a permanent data or system recovery solution. A backup solution helps provide protection that is not provided by checkpoints. Even though virtual machine checkpoints provide a convenient way to store different points of system state, data, and configuration, there are some inherent risks of unintended data loss if they are not managed appropriately. Checkpoints do not protect against problems that may occur on the host, such as a hardware malfunction on the physical computer or a software-related issue in the management operating system. Also, applications that run in a virtual machine are not aware of the snapshot, and will not be able to adjust appropriately.
Keep the following considerations in mind, especially if you plan to use checkpoints on a virtual machine in a production environment:
· The presence of a virtual machine checkpoint reduces the disk performance of the virtual machine.
· We do not recommend using checkpoints on virtual machines that provide time-sensitive services, or when performance or the availability of storage space is critical.
How could Hyper-V snapshots impact a virtual machine performance?
As shown previously, when a Hyper-V administrator takes snapshots for a virtual machine; the virtual will need to read the VM data from more and more files and then the virtual machine performance will start to degrade and may even result in very poor performance.
Impact
Available space may run out on the physical disk that stores the checkpoints files. If this occurs, no additional disk operations can be performed on the physical storage. Any virtual machine that relies on the physical storage could be affected.
If physical disk space runs out, any running virtual machine that has checkpoints or virtual hard disks stored on that disk may be paused automatically. Hyper-V Manager shows the status of these virtual machines as "paused-critical".

Will be waiting for your logs.
Post Reply