Starwind Cluster Service wont start

muhfugen · Mon Aug 07, 2017 6:44 pm

I am trying to follow your guide to enable the creation of VVols with Starwind VSAN. Unfortunately I am unable to do pretty much anything which requires using WMIC to access the STARWIND_ClusterService class. running

Code: Select all

wmic /namespace:"\\root\starwind" path STARWIND_ClusterService call CreateCluster name="TestCluster" login="MYLOGIN" password="MYPASSWORD"

gives the error

Code: Select all

Description: Not Found

and simply running

Code: Select all

wmic /namespace:"\\root\starwind" path STARWIND_ClusterService

gives the error

Code: Select all

Description: Invalid Class

.

After looking on the VSAN nodes I noticed that the Starwind Cluster Service is present on both nodes, but it wasn't running. If I try to manually start the service, it starts and then immediately stops. If I look through the log files there are not any log entries during the time when I attempted to manually start the service. The log entries with the password hash lines removed can be found here https://my.mixtape.moe/eyawgd.zip

If any of this is relevant, the Starwind nodes are running on Server 2016 Core, and are currently in the process of syncing a high availability target.

Wed Aug 09, 2017 3:27 pm

Dear muhfugen, could you please collect logs with our special tool:
https://knowledgebase.starwindsoftware. ... ogs-bat-2/
Please upload the archive with logs to any cloud storage and send us the link for further investigation.

muhfugen · Tue Aug 15, 2017 6:32 am

Sergey thanks for being willing to look in to this but I think i'm going to give VSAN a break. It just doesn't seem entirely stable at the moment in my environment, i've setup two nodes with it and have been evaluating it over the past few days. Even with 4x 480GB SSDs in a RAID 0 on each node and 10GbE between them performance rarely exceeded 1gbit/sec total in HA mode and iSCSI setup for MPIO round-robin, occasionally it would hit 2gbit/sec. Looking at the task manager on each of the nodes would show that the CPU and disk utilization were low, this is further verified by the disk activity lights which would just come on intermittently. And the re-sync process after rebooting a node would take over 8 hours.

More recently I tried breaking up the HA cluster in to separate nodes and it has gotten even worse. After creating the new target and storage vMotioning VMs to it, the target shit itself and would give IO errors such as "cp: can't stat 'Microsoft Active Directory 2016 #2/vmware-13.log': Input/output error" if I tried to copy files on it. Rebooting the VSAN node made the data accessible again, but now the management console can't connect to it saying that the starwind service wasnt running even though it is because I am evacuating data from it.

I appreciate that you guys give away NFR licenses and have a free version, and the VTL functions have worked well for me over the year or so, but it seems like there are still a lot of bugs in VSAN. And right now I just dont have the time to work with you guys to fix them. This is only a homelab so its not like it is a production environment is down, but i'm going to need some time to get everything working and will probably wait till version 9 before evaluating it again. Once I move everything back to plain datastores on the host i'll try uploading the logs so you can take a look at them if you're interested.

The setup was:
Node 1
Hypervisor: ESXi 6.5
OS: Server 2016
Motherboard: SuperMicro X9DAE
CPU: dual E5-2660v2
RAM: 160GB
NICs: dual Intel X520
RAID: Areca 1883ix-24, 4x 480GB Seagate 600 Pro SSDs in a RAID 0 with a 64k stripe size. Multiple LUNs were provisioned from this array, for VSAN boot and data
The VM was configured with 4 vCPUs, 24GB RAM with 100% reservation. The data disk was passed through as a vRDM. Both the replication and iSCSI networks were on separate NICs.

Node 2
Hypervisor: ESXi 6.5
OS: Server 2016
Motherboard: SuperMicro X9DRH-iTF
CPU: dual E5-2650
RAM: 128GB
NICs: dual Intel X520
RAID: LSI 9207 with IR firmware, 4x 480GB Seagate 600 Pro SSDs in a RAID 0 with a 64k stripe size. Due to firmware limitations a single LUN was provisioned from this array.
The VM was configured with 4 vCPUs, 24GB RAM with 100% reservation. The data disk was a VMDK. Both the replication and iSCSI networks were on separate NICs.

The target was configured as a 3TB thin provisioned LSFS volume with deduplication enabled, and synchronous writes.

On a side note, after breaking up a replicated target, it constantly complains replication partners are not set. And the option to manually defrag the target volume is not there.

Wed Aug 16, 2017 2:59 pm

Looks like you have to convert the virtual disk to Thick Provision Eager Zeroed. This is required to enable clustering features. I hope this should help:
https://kb.vmware.com/selfservice/micro ... Id=1035823