Datastore is being "corrupted" each and every time!

Software-based VM-centric and flash-friendly VM storage + free version

Moderators: anton (staff), art (staff), Max (staff), Anatoly (staff)

Post Reply
MISDestek
Posts: 2
Joined: Tue Jun 23, 2020 3:54 pm

Tue Jun 23, 2020 4:04 pm

Hi,

When we try to add datastore from vSphere 7.0 ESXi hosts in our compute and storage separated architecture;
1. For Storage1, it works without corruption but slow
2. For Storage2, it’s being corrupted each and every time while adding a new datastore in VMWare.

But, when we unplug or disconnect the cables of the second node of the StarWind HA Cluster, it’s not being corrupted. It doesn’t matter which node it is. Even if we disconnect the cables of the first node of the StarWind HA Cluster, the result is the same, it’s not being corrupted.

Whenever, we use the active – active paths of the targets, then the datastore is being corrupted each and every time.


Here is our architecture summary:

We’ve got two HA StarWind Virtual SAN Free storage clusters on Windows 2019 Standard Servers (Storage1, Storage2)


Each storage has two nodes with HA Clustered devices. (Total 4 nodes)


Storage1 has a H/W RAID10 array on 16 x 2 TB SSD Disks on both of two nodes each.

Storage2 has a H/W RAID10 array on 16 x 8 TB HDD Disks on both of two nodes each.



RAID Controller settings for both of 2 nodes belong to Storage1 are like below;

RAID Type : RAID 1+0
Number of Disks in RAID : 16
Disk Cache Policy : Default
Write Policy : Write-Through
Read Policy : No Read Ahead
Stripe Size : 64k




RAID Controller settings for both of 2 nodes belong to Storage2 are like below;
RAID Type : RAID 1+0
Number of Disks in RAID : 16
Disk Cache Policy : Default
Write Policy : Write-Back
Read Policy : Read Ahead
Stripe Size : 64k



Underlying Disks for the both of two nodes of the Storage1 are like below;
16 x Micron_1300_MTFD 2 TB (H/W RAID1+0)

Underlying Disks for the both of two nodes of the Storage2 are like below;
16 x Seagate 8TB ST8000NM0055 (H/W RAID1+0)




RAID Controller Card details for all of the 4 nodes are:

AVAGO 3108 MegaRAID RAID Controller

Package
24.21.0-0100

FW Version
4.680.00-8465

BIOS Version
6.36.00.3_4.19.08.00_0x06180203

Boot Block Version
3.07.00.00-0003



Server H/W details for all of the 4 nodes are like below:
Supermicro SYS-6029P-TRT
Firmware Revision: 01.71.11
Firmware Build Time: 10/25/2019 BIOS Version: 3.3
BIOS Build Time: 02/24/2020
Redfish Version: 1.0.1
CPLD Version: 02.b1.0E
2 x Intel Xeon Silver 4214 CPU
4 x Micron 32GB 36ASF4G72PZ-2G9E2 RAM


NIC Cards for iSCSI and Synchronization Channels for all of the 4 nodes are like below: (Jumbo Frames at size 9014 Bytes enabled)
Intel X722 DUAL 10GBASE-T
Supermicro DUAL 10GBASE-T





Powershell Source Code to Create Target and Lun Device

Import-Module StarWindX

try
{
Enable-SWXLog

$server = New-SWServer -host 127.0.0.1 -port 3261 -user root -password xxxxx

$server.Connect()



$firstNode = new-Object Node


$firstNode.ImagePath = "My computer\F\Data1\StarWindDevices1"
$firstNode.ImageName = "dstdskhdd01_n1"
$firstNode.Size = 1024
$firstNode.CreateImage = $true
#$firstNode.StorageName = "dststrsan02"
$firstNode.TargetAlias = "dstdskhdd01_nd1"
$firstNode.AutoSynch = $true

$firstNode.SyncInterface = "#p2=172.18.21.20:3260,172.18.22.20:3260,172.18.23.20:3260,172.18.24.20:3260"

$firstNode.HBInterface = "#p2=172.18.85.23:3260,172.18.84.22:3260,172.18.94.22:3260"

$firstNode.PoolName = "pool1"
$firstNode.SyncSessionCount = 1
$firstNode.ALUAOptimized = $true
$firstNode.CacheMode = "wb"
$firstNode.CacheSize = 5120
$firstNode.FailoverStrategy = 0
# $firstNode.CreateTarget = $createTarget

#
# device sector size. Possible values: 512 or 4096(May be incompatible with some clients!) bytes.
#
$firstNode.SectorSize = 512



$secondNode = new-Object Node

$secondNode.HostName = "172.18.85.23"
$secondNode.HostPort = "3261"
$secondNode.Login = "root"
$secondNode.Password = "Srv951"
$secondNode.ImagePath = "My computer\F\Data1\StarWindDevices1"
$secondNode.ImageName = "dstdskhdd01_n2"
$secondNode.Size = 1024
$secondNode.CreateImage = $true
#$secondNode.StorageName = "dststrsan02"
$secondNode.TargetAlias = "dstdskhdd01_nd2"
$secondNode.AutoSynch = $true

$secondNode.SyncInterface = "#p1=172.18.21.10:3260,172.18.22.10:3260,172.18.23.10:3260,172.18.24.10:3260"

$secondNode.HBInterface = "#p1=172.18.85.21:3260,172.18.84.21:3260,172.18.94.21:3260"

$secondNode.SyncSessionCount = 1
$secondNode.ALUAOptimized = $true
$secondNode.CacheMode = "wb"
$secondNode.CacheSize = 5120
$secondNode.FailoverStrategy = 0
# $secondNode.CreateTarget = $createTarget2
$secondNode.SectorSize = 512

$device = Add-HADevice -server $server -firstNode $firstNode -secondNode $secondNode -initMethod "Clear"

$syncState = $device.GetPropertyValue("ha_synch_status")

while ($device.SyncStatus -ne [SwHaSyncStatus]::SW_HA_SYNC_STATUS_SYNC)
{

$device.Refresh()

$syncState = $device.GetPropertyValue("ha_synch_status")
$syncPercent = $device.GetPropertyValue("ha_synch_percent")

Start-Sleep -m 2000

Write-Host "Synchronizing: $($syncPercent)%" -foreground yellow




}
}
catch
{
Write-Host "Exception $($_.Exception.Message)" -foreground red
}
finally
{
$server.Disconnect()
}
MISDestek
Posts: 2
Joined: Tue Jun 23, 2020 3:54 pm

Thu Jun 25, 2020 3:37 pm

Hi everyone,

I want to share updated info of our case if someone needs it;

I think we found the root cause.

Actually, two storage systems have installed with 4 x 10Gbps (40 Gbps) sync ports. Two of those ports are on the Intel X722 Cards. Intel X722 Cards are little bit problematic devices.

We've disabled Jumbo Package option on both of the ports of the Intel X722 Cards on both of two servers and renamed the iSER_DM.dll file as iSER_DM.dll.bak under the StarWind software folder. Then we tried to create a new HA LUN on the Starwind and a related new datastore on the vSphere ESXi 7.0. This time there was no corruption.
yaroslav (staff)
Staff
Posts: 2279
Joined: Mon Nov 18, 2019 11:11 am

Thu Jun 25, 2020 3:46 pm

Hi there, thank you for the update. It can be a matter of drivers, but if Jumbos do not work, just disable them please. Were jumbos enabled everywhere (i.e., host level and VM)?
Sorry for the delayed response.

Please note that iSER in StarWind VSAN is an experimental feature that is not intended for production yet. If you want to bring this cluster to production, I'd rather use good old iSCSI.

Let me know if the problem is still there.
Foxbat_25
Posts: 3
Joined: Tue Jun 30, 2020 2:40 pm

Tue Jun 30, 2020 8:07 pm

Thanks for the input, that's awesome!!! I had a first bad experience with Starwind a couple months ago, when we almost lost an entire investment simulation in Athens at work due to corruption (fortunately, we had several back up saves), and the offending computer runs on - you have guessed it - an X722 card.

I wanted our company to change them for something else for a while, and I now have the evidence I needed to convince our manager! (Or, at least, I hope)
yaroslav (staff)
Staff
Posts: 2279
Joined: Mon Nov 18, 2019 11:11 am

Tue Jun 30, 2020 8:30 pm

Hope you'll finally replace the NICs!
Post Reply