Tue Aug 15, 2017 6:32 am
Sergey thanks for being willing to look in to this but I think i'm going to give VSAN a break. It just doesn't seem entirely stable at the moment in my environment, i've setup two nodes with it and have been evaluating it over the past few days. Even with 4x 480GB SSDs in a RAID 0 on each node and 10GbE between them performance rarely exceeded 1gbit/sec total in HA mode and iSCSI setup for MPIO round-robin, occasionally it would hit 2gbit/sec. Looking at the task manager on each of the nodes would show that the CPU and disk utilization were low, this is further verified by the disk activity lights which would just come on intermittently. And the re-sync process after rebooting a node would take over 8 hours.
More recently I tried breaking up the HA cluster in to separate nodes and it has gotten even worse. After creating the new target and storage vMotioning VMs to it, the target shit itself and would give IO errors such as "cp: can't stat 'Microsoft Active Directory 2016 #2/vmware-13.log': Input/output error" if I tried to copy files on it. Rebooting the VSAN node made the data accessible again, but now the management console can't connect to it saying that the starwind service wasnt running even though it is because I am evacuating data from it.
I appreciate that you guys give away NFR licenses and have a free version, and the VTL functions have worked well for me over the year or so, but it seems like there are still a lot of bugs in VSAN. And right now I just dont have the time to work with you guys to fix them. This is only a homelab so its not like it is a production environment is down, but i'm going to need some time to get everything working and will probably wait till version 9 before evaluating it again. Once I move everything back to plain datastores on the host i'll try uploading the logs so you can take a look at them if you're interested.
The setup was:
Node 1
Hypervisor: ESXi 6.5
OS: Server 2016
Motherboard: SuperMicro X9DAE
CPU: dual E5-2660v2
RAM: 160GB
NICs: dual Intel X520
RAID: Areca 1883ix-24, 4x 480GB Seagate 600 Pro SSDs in a RAID 0 with a 64k stripe size. Multiple LUNs were provisioned from this array, for VSAN boot and data
The VM was configured with 4 vCPUs, 24GB RAM with 100% reservation. The data disk was passed through as a vRDM. Both the replication and iSCSI networks were on separate NICs.
Node 2
Hypervisor: ESXi 6.5
OS: Server 2016
Motherboard: SuperMicro X9DRH-iTF
CPU: dual E5-2650
RAM: 128GB
NICs: dual Intel X520
RAID: LSI 9207 with IR firmware, 4x 480GB Seagate 600 Pro SSDs in a RAID 0 with a 64k stripe size. Due to firmware limitations a single LUN was provisioned from this array.
The VM was configured with 4 vCPUs, 24GB RAM with 100% reservation. The data disk was a VMDK. Both the replication and iSCSI networks were on separate NICs.
The target was configured as a 3TB thin provisioned LSFS volume with deduplication enabled, and synchronous writes.
On a side note, after breaking up a replicated target, it constantly complains replication partners are not set. And the option to manually defrag the target volume is not there.