VMs keep getting corrupted

Software-based VM-centric and flash-friendly VM storage + free version

Moderators: anton (staff), art (staff), Max (staff), Anatoly (staff)

Post Reply
Sajmon
Posts: 11
Joined: Fri Mar 22, 2019 3:39 pm

Thu Jun 20, 2019 10:59 am

Hello all,
I am writing to get my setup checked as virtual machines I run in the cluster keep getting corrupted for some reason. I have 2 node hyper converged Hyper-V cluster with 2 CSV disks, each one with L1 and L2 cache. Cluster works well, failover tests run all ok, but I run into weird issues from time to time and it only happens to machines residing in the cluster storage. Sometimes the VM runs ok for a week and then it freezes and it's unable to boot. Sometimes I issue shutdown command and the machine locks and does not respond leaving me with only option to kill the VM process and thus corrupting VHDX. Sometimes VMs gets corrupted when doing failover to another host. All looks fine, but the machine stops responding and never boots up again. This only happens to VMs in the cluster with network connectivity (actually having some IO during the day) and never to the VMs stored locally on the system drive, or clustered VMs without network connectivity. I went through the settings zillion times and found nothing except the fact that I have RAID 5 spindle array and it's recommended to have RAID 0,1 or 10. It is non standard, but I'd expect it cause slow response, not data corruption. Second thing is I setup Round robin MPIO policy instead of Least queue depth as LQD had caused lots of problems with CSV turning to RAW device and so on and RR policy got rid of that problem. Other than that here is my config script and swdsk file to check. I must be missing something simple but important and it would be shame having to destroy this cluster as I really like the technology and want to use it as POC for possible future customers. Any help appreciated. Thanks.

SWDSK file:

<device active="true" plugin="imagefile" name="imagefile">
<storages>
<storage id="1" type="device" name="imagefile" lun="0x0">
<interval size="1370" units="GB"/>
<inquiry>
<serial_id>4FCC9E87A57E297D</serial_id>
<vendor id="STARWIND"/>
<product id="STARWIND " revision="0001"/>
<eui_64>4FCC9E87A57E297D</eui_64>
</inquiry>
<geometry>
<sector size="4096" psize="4096"/>
<track sectors="16"/>
<cylinder tracks="32" count="65535"/>
</geometry>
<caching>
<cache type="write-back" size="4" units="GB" level="1">
<storages>
<storage_ref id="1"/>
</storages>
</cache>
<cache type="write-through" size="110" units="GB" level="2">
<storages>
<storage_ref id="4"/>
</storages>
</cache>
</caching>
</storage>
</storages>
</device>
<system>
<resources>
<storages>
<storage id="1" name="RAM" type="RAM">
<interval size="4" units="GB"/>
</storage>
<storage id="2" name="My computer\E\CSV2\MasterCSV2.img" type="file">
<interval size="1370" units="GB"/>
</storage>
<storage id="4" name="My Computer\F\L2cache\L2cacheCSV2.swdsk" type="device">
<interval size="110" units="GB"/>
</storage>
</storages>
<network/>
</resources>
</system>


CreateCSV script:

$firstNode = new-Object Node

$firstNode.HostName = "xx.xx.xx.xx"
$firstNode.ImagePath = "My computer\E\CSV2"
$firstNode.ImageName = "MasterCSV2"
$firstNode.Size = 1402880
$firstNode.CreateImage = $true
$firstNode.TargetAlias = "MasterCSV2"
$firstNode.AutoSynch = $true
$firstNode.SyncInterface = "#p2=10.10.12.13:3260,10.10.12.14:3260"
$firstNode.HBInterface = "#p2=10.10.11.11:3260,10.10.13.11:3260,172.23.99.11:3260"
$firstNode.CacheSize = 4096
$firstNode.CacheMode = "wb"
$firstNode.PoolName = "CSVpool2"
$firstNode.SyncSessionCount = 1
$firstNode.ALUAOptimized = $true

#
# device sector size. Possible values: 512 or 4096(May be incompatible with some clients!) bytes.
#
$firstNode.SectorSize = 4096

$secondNode = new-Object Node

$secondNode.HostName = "yy.yy.yy.yy"
$secondNode.HostPort = "3261"
$secondNode.Login = "root"
$secondNode.Password = "starwind"
$secondNode.ImagePath = "My computer\E\CSV2"
$secondNode.ImageName = "PartnerCSV2"
$secondNode.Size = 1402880
$secondNode.CreateImage = $true
$secondNode.TargetAlias = "PartnerCSV2"
$secondNode.AutoSynch = $true
$secondNode.SyncInterface = "#p1=10.10.12.10:3260,10.10.12.11:3260"
$secondNode.HBInterface = "#p1=10.10.11.10:3260,10.10.13.10:3260,172.23.99.10:3260"
$secondNode.CacheSize = 4096
$secondNode.CacheMode = "wb"
$secondNode.SyncSessionCount = 1
$secondNode.ALUAOptimized = $true
Boris (staff)
Staff
Posts: 805
Joined: Fri Jul 28, 2017 8:18 am

Mon Jun 24, 2019 10:07 am

Could you provide logs from your servers? You can use https://knowledgebase.starwindsoftware. ... collector/ to get them bundled.
Sajmon
Posts: 11
Joined: Fri Mar 22, 2019 3:39 pm

Mon Jun 24, 2019 12:26 pm

Hello Boris,
thanks for getting back to me. I have sent the logs to support@starwindsoftware.com with subject name "VMs keep getting corrupted" as we should not post logs here.
Boris (staff)
Staff
Posts: 805
Joined: Fri Jul 28, 2017 8:18 am

Mon Jun 24, 2019 1:10 pm

The ticket is well received.
I will keep the community updated on the process.
Boris (staff)
Staff
Posts: 805
Joined: Fri Jul 28, 2017 8:18 am

Thu Jun 27, 2019 1:25 pm

Here is an update.
The configuration initially used had major flaws. At the moment, the environment is being reconfigured by Sajmon and I believe the results will be reported later.
Sajmon
Posts: 11
Joined: Fri Mar 22, 2019 3:39 pm

Tue Jul 09, 2019 8:03 am

Hello all,
so I have placed iSCSI and Synchronization link under NIC teaming, which was not desirable and therefore data corruptions occurred. NTFS events vanished after I reconfigured networking and dedicated links to iSCSI and Sync and was able to run testing VMs couple days with no problem.
Oleg(staff)
Staff
Posts: 568
Joined: Fri Nov 24, 2017 7:52 am

Tue Jul 09, 2019 8:12 am

Hello Sajmon,
Thank you for confirmation.
Post Reply