HA device timeout error

Software-based VM-centric and flash-friendly VM storage + free version

Moderators: anton (staff), art (staff), Max (staff), Anatoly (staff)

yaroslav (staff)
Staff
Posts: 2904
Joined: Mon Nov 18, 2019 11:11 am

Fri Mar 01, 2024 1:31 am

Sorry, did not have a chance to look into them. I will review them today/Sat and let you know.
yaroslav (staff)
Staff
Posts: 2904
Joined: Mon Nov 18, 2019 11:11 am

Sun Mar 03, 2024 10:21 pm

Sorry for the delay.
I checked the logs. The first thing I noticed was the build. Please update to 15260.
Also, both systems' service logs end on 12/23.
Download the latest build at https://starwind.com/tmplink/starwind-v8.exe
Guide https://knowledgebase.starwindsoftware. ... d-version/
Please also make sure that you use no teaming (https://www.starwindsoftware.com/best-p ... practices/).

Please update StarWind VSAN as certain thresholds might change on their own after the update.
If the issue persists, check the storage. The actions below imply that you monitor and double-checked storage health
1. Make sure HA devices are synchronized
2. Make sure HA devices are connected over iSCSI.
3. For one node
3.1. Stop StarWindService.
3.2. Back up StarWind.cfg (can be found at C:\Program Files\StarWind Software\StarWind).
3.3. Edit StarWind.cfg
3.4. Locate
<StorPerfDegTimeLimitMs value="7000"/> -> set to 15000
<iScsiGenCmdSendCmdTimeoutInSec value="10"/> -> set to 18
<iScsiPingCmdSendCmdTimeoutInSec value="5"/> -> set to 14
3.5. Save.
3.6. Exit.
3.7. Start the service.
3.8. Let fast synchronization happen.
4. Repeat 1-3 for another node.
logitech
Posts: 32
Joined: Sun Feb 04, 2024 9:50 am

Tue Mar 05, 2024 7:01 am

thanks for the heads up. you are amazing!

First, we will work on the update over the weekend, and after that will apply the fine-tuning.
yaroslav (staff)
Staff
Posts: 2904
Joined: Mon Nov 18, 2019 11:11 am

Tue Mar 05, 2024 8:34 am

You are welcome. Please note that updates to 15260 override some of the threshold settings. This being said, make the threshold changes after the update.
Cheers :)
FlashMe
Posts: 16
Joined: Wed Aug 24, 2022 8:23 pm

Mon Oct 14, 2024 9:53 am

Hey Guys,

unfortunately we see those errors on all StarWind VSAN installations.

It exists on all Installations independend of OS and installation type. It exists on CVM and Microsoft Bare Metal installations too.

HPE, DELL and Lenovo. Different Raid Controller and type.

It seems that the root cause is StarWnd VSAN itself not hardware. We have many customers which complaints about the problem and ask us what is going on.
yaroslav (staff)
Staff
Posts: 2904
Joined: Mon Nov 18, 2019 11:11 am

Mon Oct 14, 2024 5:48 pm

StarWind VSAN is a latency-sensitive service as it is responsible for active-active replication.
True, the event will be observed on ANY hardware as the cause is the processes happening on the hardware level, not StarWind VSAN itself. Think of it like VSAN just experiences latencies it cannot tolerate. It might experience it on any hardware.
So, the issue can be requests, workload, underlying storage processes, or configuration. Service here is just logging those.

p.s. HA device sync status almost always heals itself in 30 minutes.
FlashMe
Posts: 16
Joined: Wed Aug 24, 2022 8:23 pm

Mon Oct 14, 2024 6:41 pm

Sorry but we saw this event too when CVM hung. Had 4 different customers where CVM and other Services hung without reason. All on latest version. No Events or alomaly on Hardware Level. No Backups or High Iops. No Consistency Check, patrol read. Nothing.

There must be Something on StarWind VSAN itself. You can speak with 4 different people which will all tell you the same.

Its easy to say its the Hardware but its not the Case. You should Check the Application and kernel.
yaroslav (staff)
Staff
Posts: 2904
Joined: Mon Nov 18, 2019 11:11 am

Mon Oct 14, 2024 7:31 pm

You are right, but it is not happening for all systems. I am telling you from the experience of reviewing numerous logs and multiple escalations.
There are definitely longer i/o path things when we move to CVM from Windows-native installation. By dropping synchronization, VSAN is responding to situations that can potentially harm performance and stability by interrupting the synchronization.
Post Reply