Underlying device response time is longer than expected

Software-based VM-centric and flash-friendly VM storage + free version

Moderators: anton (staff), art (staff), Anatoly (staff), Max (staff)

Post Reply
mkhc
Posts: 5
Joined: Sat Jun 27, 2015 12:48 pm

Wed Dec 02, 2020 1:49 am

I installed VSAN on two-node Hyper-V HyperConverged setting back in 2015. It has been working well until recently (2 weeks) one of the nodes shows the warning message "Underlying device response time is longer than expected". It has caused one of the Hyper-V cluster node failed, some VM's failed and some files corrupted on the file server.

I've upgraded StarWind software on both nodes but the warnings still show, frequently, several piles a day.

I ran diskspd on the CSV (C:\ClusterStorage\...) and found the read and write performance dropped to one-third on the problem node, when compared with benchmark obtained 5 years ago. The underlying device is HP Smart Array P440ar with 18 spinning SAS disks. I tested diskspd on the underlying device (E:\..) and found no performance difference.

The diskspd test on the other node is normal.

Any clue on this?
yaroslav (staff)
Staff
Posts: 2279
Joined: Mon Nov 18, 2019 11:11 am

Wed Dec 02, 2020 2:33 pm

Underlying storage response events often point to the problems with underlying storage (e.g., drive failure or underlying storage processes). I would recommend checking if all the disks are doing fine first.
Also, may I learn the storage configuration you have?
Also note that underlying storage response events cannot cause the entire server to go down. If delays are huge, they lead to synchronization loss. If all is configured right, sync loss does not cause data corruption. Do you have any logs from that time?
If the issues started happening recently, there might be something with the underlying storage.
mkhc
Posts: 5
Joined: Sat Jun 27, 2015 12:48 pm

Tue Dec 29, 2020 9:46 am

The server showed RAID card error, so I called server support to replace it and the cache battery. After replace, done extensive diagnostics on the hardware level including hard disk surface scan; and sfc and chkdsk (on the volume where Starwind image files reside) on OS level. All tests returns no error.

However, Starwind on this node still shows underlying device response time warning. Some were more than 10 seconds and the sync paused for 30 minutes.

I followed this guideline (dated January 2015) when first installed:
Installation and Configuration of Hyper-Converged 2 Nodes with Hyper-V Cluster (https://www.starwindsoftware.com/Starwi ... luster.pdf)

It is a two-node setup, each in a HP DL380 Gen 9 server with Smart Array P440ar and 18 SAS 10K 300GB disks in RAID 5. Starwind serves 3 CSV's totaling 4TB in size. The only change I made is that the MPIO mode was set to failover.

I'm not sure what I should do to troubleshoot the problem now. Would reinstalling Starwind or even Windows (on this node) help?
yaroslav (staff)
Staff
Posts: 2279
Joined: Mon Nov 18, 2019 11:11 am

Tue Dec 29, 2020 10:10 am

First, you can try updating StarWind VSAN as discussed here https://knowledgebase.starwindsoftware. ... d-version/.
RAID 5 out of HDDs is not what we generally recommend due to potential performance issues; see more about recommended RAID configurations here https://knowledgebase.starwindsoftware. ... ssd-disks/. Please share the logs with me to let me examine the delays. Use this tool to collect logs https://knowledgebase.starwindsoftware. ... collector/. You can transfer them via GoogleDrive.
yaroslav (staff)
Staff
Posts: 2279
Joined: Mon Nov 18, 2019 11:11 am

Tue Jan 05, 2021 3:52 pm

Hi,

I have checked the log. The service on node 1 is not running since DEC 28. Please start it and let synchronization to complete.
mkhc
Posts: 5
Joined: Sat Jun 27, 2015 12:48 pm

Fri Jan 29, 2021 9:17 am

I disabled the service on Dec 28 and let the other node run for this production cluster.

I started it again on Jan 14 and monitored the log. No more complains about underlying storage were observed, so I resumed Failover Cluster Node (Hyper-V) on this node.

Kept monitoring the log since then, and the following warnings about underlying storage have been logged until now.

Code: Select all

1/20 22:06:20.084156 c2c Common: CStarWindStorageDevice::AsyncReadWriteCompleted: Underlying storage request(0x000000B82170FAB0, opcode 0x88) execution time is 3484 ms.
1/20 22:06:20.084331 c2c HA: Ssc_Request_Task::complete: Warning(Underlying storage): sscRequestTask(0x000000B82170FA80) storageRequest(0x000000B82170FAB0) GeneralFunction(Execute SCSI Command) GeneralOpCode(0x28) ex(0x0), StorageOpCode(0x88), StorageDiffTimeCompleteEXEC = 3484, ms DiffTimeCompleteINIT = 3484 ms, DiffTimeCompleteEXEC = 3484 ms. Current target: 'iqn.2008-08.com.starwindsoftware:engg-cloudhost1.enggoffice.hku.hk-csv2'
1/23 2:56:08.134101 1394 Common: CStarWindStorageDevice::AsyncReadWriteCompleted: Underlying storage request(0x000000B82170F1B0, opcode 0x88) execution time is 3625 ms.
1/23 2:56:08.134282 1394 HA: Ssc_Request_Task::complete: Warning(Underlying storage): sscRequestTask(0x000000B82170F180) storageRequest(0x000000B82170F1B0) GeneralFunction(Execute SCSI Command) GeneralOpCode(0x28) ex(0x0), StorageOpCode(0x88), StorageDiffTimeCompleteEXEC = 3625, ms DiffTimeCompleteINIT = 3625 ms, DiffTimeCompleteEXEC = 3625 ms. Current target: 'iqn.2008-08.com.starwindsoftware:engg-cloudhost1.enggoffice.hku.hk-csv2'
1/23 5:45:03.800312 c2c Common: CStarWindStorageDevice::AsyncReadWriteCompleted: Underlying storage request(0x000000B78F22E8F0, opcode 0x8A) execution time is 3609 ms.
1/23 5:45:03.800503 c2c HA: Ssc_Request_Task::complete: Warning(Underlying storage): sscRequestTask(0x000000B78F22E8C0) storageRequest(0x000000B78F22E8F0) GeneralFunction(Execute SCSI Command) GeneralOpCode(0x2A) ex(0x0), StorageOpCode(0x8A), StorageDiffTimeCompleteEXEC = 3609, ms DiffTimeCompleteINIT = 3609 ms, DiffTimeCompleteEXEC = 3609 ms. Current target: 'iqn.2008-08.com.starwindsoftware:engg-cloudhost1.enggoffice.hku.hk-csv2'
1/24 13:23:34.378752 16c4 Common: CStarWindStorageDevice::AsyncReadWriteCompleted: Underlying storage request(0x000000B81A3AE090, opcode 0x88) execution time is 3438 ms.
1/24 13:23:34.379216 16c4 HA: Ssc_Request_Task::complete: Warning(Underlying storage): sscRequestTask(0x000000B81A3AE060) storageRequest(0x000000B81A3AE090) GeneralFunction(Execute SCSI Command) GeneralOpCode(0x28) ex(0x0), StorageOpCode(0x88), StorageDiffTimeCompleteEXEC = 3438, ms DiffTimeCompleteINIT = 3438 ms, DiffTimeCompleteEXEC = 3438 ms. Current target: 'iqn.2008-08.com.starwindsoftware:engg-cloudhost1.enggoffice.hku.hk-csv3'
1/27 9:04:08.423325 193c Common: CStarWindStorageDevice::AsyncReadWriteCompleted: Underlying storage request(0x000000B81D640700, opcode 0x88) execution time is 3688 ms.
1/27 9:04:08.423491 193c HA: Ssc_Request_Task::complete: Warning(Underlying storage): sscRequestTask(0x000000B81D6406D0) storageRequest(0x000000B81D640700) GeneralFunction(Execute SCSI Command) GeneralOpCode(0x28) ex(0x0), StorageOpCode(0x88), StorageDiffTimeCompleteEXEC = 3688, ms DiffTimeCompleteINIT = 3688 ms, DiffTimeCompleteEXEC = 3688 ms. Current target: 'iqn.2008-08.com.starwindsoftware:engg-cloudhost1.enggoffice.hku.hk-csv2'
1/27 9:04:08.424372 193c Common: CStarWindStorageDevice::AsyncReadWriteCompleted: Underlying storage request(0x000000B81D63E600, opcode 0x88) execution time is 3688 ms.
1/27 9:04:08.424417 193c HA: Ssc_Request_Task::complete: Warning(Underlying storage): sscRequestTask(0x000000B81D63E5D0) storageRequest(0x000000B81D63E600) GeneralFunction(Execute SCSI Command) GeneralOpCode(0x28) ex(0x0), StorageOpCode(0x88), StorageDiffTimeCompleteEXEC = 3688, ms DiffTimeCompleteINIT = 3688 ms, DiffTimeCompleteEXEC = 3688 ms. Current target: 'iqn.2008-08.com.starwindsoftware:engg-cloudhost1.enggoffice.hku.hk-csv2'

---------- Windows update, rebooted node ----------

1/28 3:49:34.952325 1e1c Common: CStarWindStorageDevice::AsyncReadWriteCompleted: Underlying storage request(0x000000929AA47B70, opcode 0x88) execution time is 3500 ms.
1/28 3:49:34.952485 1e1c HA: Ssc_Request_Task::complete: Warning(Underlying storage): sscRequestTask(0x000000929AA47B40) storageRequest(0x000000929AA47B70) GeneralFunction(Execute SCSI Command) GeneralOpCode(0x28) ex(0x0), StorageOpCode(0x88), StorageDiffTimeCompleteEXEC = 3500, ms DiffTimeCompleteINIT = 3500 ms, DiffTimeCompleteEXEC = 3500 ms. Current target: 'iqn.2008-08.com.starwindsoftware:engg-cloudhost1.enggoffice.hku.hk-csv2'
1/28 3:49:34.959652 1e1c Common: CStarWindStorageDevice::AsyncReadWriteCompleted: Underlying storage request(0x000000929EDF3440, opcode 0x88) execution time is 3453 ms.
1/28 3:49:34.959740 1e1c HA: Ssc_Request_Task::complete: Warning(Underlying storage): sscRequestTask(0x000000929EDF3410) storageRequest(0x000000929EDF3440) GeneralFunction(Execute SCSI Command) GeneralOpCode(0x28) ex(0x0), StorageOpCode(0x88), StorageDiffTimeCompleteEXEC = 3453, ms DiffTimeCompleteINIT = 3453 ms, DiffTimeCompleteEXEC = 3453 ms. Current target: 'iqn.2008-08.com.starwindsoftware:engg-cloudhost1.enggoffice.hku.hk-csv2'
1/29 9:59:30.674623 1e24 Common: CStarWindStorageDevice::AsyncReadWriteCompleted: Underlying storage request(0x000000929AB40310, opcode 0x88) execution time is 3391 ms.
1/29 9:59:30.674784 1e24 HA: Ssc_Request_Task::complete: Warning(Underlying storage): sscRequestTask(0x000000929AB402E0) storageRequest(0x000000929AB40310) GeneralFunction(Execute SCSI Command) GeneralOpCode(0x28) ex(0x0), StorageOpCode(0x88), StorageDiffTimeCompleteEXEC = 3391, ms DiffTimeCompleteINIT = 3391 ms, DiffTimeCompleteEXEC = 3391 ms. Current target: 'iqn.2008-08.com.starwindsoftware:engg-cloudhost1.enggoffice.hku.hk-csv2'
Although there are still occasional warnings of around 3 seconds each, I accepted this node is usable for production. Yet, I turned on the email notification on the Starwind so that if there are more than 10 seconds delay, I got notified.
yaroslav (staff)
Staff
Posts: 2279
Joined: Mon Nov 18, 2019 11:11 am

Fri Jan 29, 2021 9:50 am

Greetings,

Please note that we do not recommend RAID 5 out of HDDs. If those warnings are scattered, just make sure to check storage health if they become too frequent or StarWind will start getting out of sync.
Let me know if there is still anything I can assist you with.
Post Reply