HA device timeout error

logitech · Sun Feb 04, 2024 9:55 am

Hi all.

out of a sudden, we started receiving the below warning/error

command WRITE underlying device response time is longer than expected 11 sec

After restart, everything works fine.

what could be the issue?

Sun Feb 04, 2024 10:17 am

Hi,

This is not an error, this is a normal event when undelying storage delays start reaching the threshold for synchronization drop. Also, it must be analyzed in context.
The event itseld indicates that there is a risk of synchronization drop.
It can be due to backups, storage problems or processes on a node or partner (e.g., patrol read/consistency check/MDADM sync), heavy i/o spikes, resync with production running, etc.
Could you please tell me more about the underlying storage configuration (e.g., RAID settings and disks)?

logitech · Sun Feb 04, 2024 10:31 am

Hi again,

It is RAID5 Configuration with Kingston SSD Mix-load. the VM's stop responding and is affecting the production.

Is there any settings that you recommend to change/update?

Sun Feb 04, 2024 11:26 am

Hi,

StarWind VSAN is providing active-active storage, meaning that storage should be available from a functioning partner when synchronization drop happens. There is a chance of misconfiguration here, and likely VMs go down the moment synchronization is lost/service stopped.
Make sure to
1. Check if all iSCSI paths are available. HA devices must be connected from both nodes.
2. StarWind VSAN 8.0.15260 installed
3. No underlying and *.img storage backup is happening.
4. Hardware is healthy and no underlying storage processes run for both nodes.

logitech · Mon Feb 05, 2024 11:12 am

1. Check if all iSCSI paths are available. HA devices must be connected from both nodes.
All sessions for HA devices are connected but I need to check during the timeout error. is this that should be checked?

2. StarWind VSAN 8.0.15260 installed
an old version installed, we will plan for the upgrade soon.

3. No underlying and *.img storage backup is happening.
a scheduled backup job is running hourly using a third-party application for the VMs. does this matter?

4. Hardware is healthy and no underlying storage processes run for both nodes.
Hardware is healthy but using an old HPE DL380e Gen8 / P420i RAID Controller with RAID5 configuration.

Mon Feb 05, 2024 11:40 am

Greetings,

All sessions for HA devices are connected but I need to check during the timeout error. is this that should be checked?

When timeout, please see if storage is connected from the synchronized server. There might be misconfiguration as synchronization drop on one node should not affect production uptime.

a scheduled backup job is running hourly using a third-party application for the VMs. does this matter?

Yes, backups induce storage latencies and if storage is RAID5 out of HDDs, synchronization drops might be frequent.

Hardware is healthy but using an old HPE DL380e Gen8 / P420i RAID Controller with RAID5 configuration.

Are disks HDD? If HDD, this is not recommended config, and Synchronization drops might be common.

logitech · Mon Feb 05, 2024 1:08 pm

When timeout, please see if storage is connected from the synchronized server. There might be misconfiguration as synchronization drop on one node should not affect production uptime.

How can this be checked?

Yes, backups induce storage latencies and if storage is RAID5 out of HDDs, synchronization drops might be frequent.

Noted

Are disks HDD? If HDD, this is not recommended config, and Synchronization drops might be common.

Mixload SSDs installed in this setup.

Mon Feb 05, 2024 1:50 pm

How can this be checked?

You can check it in the iSCSI Initiator. What hypervisor do you use?

The events themselves often come from the underlying storage and if synchronization drops are self-healed in 30 min, provided that everything is OK with the storage. Downtime normally does not happen when half of the paths disappear as there is another mirror that acts as "active".

logitech · Mon Feb 05, 2024 2:20 pm

What hypervisor do you use?

Microsoft HyperV
I got your point, I will check the iSCSI initiator if the sessions are active or disconnected right?

By the way, I was reading the release notes of the latest version of Starwind and I see the below fix. does this relate to the sync and HA error?

Synchronous Replication
Fixed an issue with synchronization/heartbeat channels unable to restore the connection when performance degradation occurs on the underline storage and VAAI is enabled for StarWind HA Devices.

Mon Feb 05, 2024 3:10 pm

DATA targets should be connected from both nodes,
The witness is to be connected only locally.

By the way, I was reading the release notes of the latest version of Starwind and I see the below fix. does this relate to the sync and HA error?

VAAI/CAW is relevant to vSphere only.
Again, this is not an error, this is a warning showing that storage is having a hard time writing/reading the data and there might be a synchronization drop.
To make it easier, you can share the logs with me. Use the StarWind Log Collector to collect the logs from both nodes https://knowledgebase.starwindsoftware. ... collector/.

logitech · Thu Feb 08, 2024 7:05 am

The witness is to be connected only locally.

I have 2 compute nodes and 2 storage nodes. how do connect witness locally?

Thu Feb 08, 2024 7:27 am

Thank you for your update.
Are storage nodes storage only or are they also a part of the cluster?

logitech · Thu Feb 08, 2024 10:05 am

they are dedicated storage and they are separate from the cluster compute nodes.

Thu Feb 08, 2024 10:17 am

Sorry for misgendering. Please connect the witness from each storage box to a both compute servers.

logitech · Thu Feb 08, 2024 11:00 am

I already have the witness quorum disk attached to the compute nodes and verified as a witness from hyper-v cluster configuration and for the MPIO policy settings failover only.
is there any other witness configuration missing on the node side?