HA device timeout error

Software-based VM-centric and flash-friendly VM storage + free version

Moderators: anton (staff), art (staff), Max (staff), Anatoly (staff)

logitech
Posts: 32
Joined: Sun Feb 04, 2024 9:50 am

Sun Feb 25, 2024 10:33 am

Hi,

I read some recommendations about MPIO settings for compute and storage-separated configurations (NOT HCI) that round robin is better than Least queue depth. Also, I need to learn about the configuration of witness connections from the compute nodes.
Our scenario:
3 compute nodes
2 vsan storage nodes
We have 2 CSV's and 1 Witness CSV
We are still facing the iSCSI reconnecting on one CSV and we are losing access to the VMs on this CSV, while the other CSV is still connected without any issues. we need to reboot the storage to reconnect.
P.S: in our scenario, witness targets must use only failover only policy?
yaroslav (staff)
Staff
Posts: 2960
Joined: Mon Nov 18, 2019 11:11 am

Sun Feb 25, 2024 10:58 am

Out of my experience, Least Queue Depth gives better performance as it is network-workload-aware. Round-robin is not.
Also, for compute and storage separated systems, connect witness as least queue depth with one path per storage node.
For your scenario, use least queue depth.
logitech
Posts: 32
Joined: Sun Feb 04, 2024 9:50 am

Sun Feb 25, 2024 12:08 pm

ok, how to solve the timed-out HA device issue? the compute nodes lose connection to CSV1 out of a sudden(reconnecting status on ISCSI initiator) while connected to other CSVs. why is it not switching traffic to CSV1 to the other storage device since it is active-active?
we are using two sessions for each connection (CSV1 and CSV2) and the witness CSV single connection.
yaroslav (staff)
Staff
Posts: 2960
Joined: Mon Nov 18, 2019 11:11 am

Sun Feb 25, 2024 2:29 pm

CSVs should be available from both storage nodes.
The storage is availabe and it does switch path if iSCSI initiator is "active". Do you see the CSV going offline?
StarWind VSAN features active-active replication. If one storage node goes dark, other is available. If it is not, but a node is synchronized, there is a network misconfiguration, most probably or MPIO is not working well.
Synchronization drop alone should not cause the otage, just same as StarWindService going off.
What I am trying to say there might be something on the network side.
logitech
Posts: 32
Joined: Sun Feb 04, 2024 9:50 am

Sun Feb 25, 2024 3:47 pm

thank you so much for the clarification.

Code: Select all

The storage is availabe and it does switch path if iSCSI initiator is "active". Do you see the CSV going offline?
NOP, I see only reconnecting message to one CSV out of 3 and this CSV hosts most of the VMS.
starwind reconnecting.jpg
starwind reconnecting.jpg (253.33 KiB) Viewed 19872 times
After rebooting the storage, everything will be healthy. it is happening once per week.

Code: Select all

there is a network misconfiguration, most probably or MPIO is not working well.
what to check exactly? MPIO configured as per your recommendation.
yaroslav (staff)
Staff
Posts: 2960
Joined: Mon Nov 18, 2019 11:11 am

Sun Feb 25, 2024 6:48 pm

Thanks for more info!
This whole thing does not relate to MPIO (i focused on MPIO as a part of best practices). Reconnecting state means that the storage is not being published for a given target. It is normal behavior if synchronization drop happens for instance.
This is not meant to be dangerous as StarWind HA storage is available from the other node and the production continues to run happily.
Please share the logs from both storage nodes with me and provide roughly the time when this happens.
Also, just to make sure we are on the same page. Are you trying to troubleshoot the reconnecting state, issues returning back to syncheonization, or storage outage (downtime)?
Also, did you try to observe the reconnecting state for 30 mins? Last but not least, is it a storage node that is restarted?
Let us do the following
1. Note the next reconnection event
2. Check the StarWind management console to see if HA devices are synchronized.
3. Collect the logs and share them here with the timestamp of when you noticed the issue.

How to collect the log https://knowledgebase.starwindsoftware. ... collector/
Will be waiting for you replies!
logitech
Posts: 32
Joined: Sun Feb 04, 2024 9:50 am

Sun Feb 25, 2024 8:40 pm

quick update!!

I found some logs that are related to this timeout that is causing this issue
StarWind Virtual SAN v8.0.0 (Build 15159, [SwSAN], Win64)
Built Jun 2 2023 14:29:14
Windows Server 2012 R2 Standard Edition (v6.3 Build 9600)
12/3 16:01:45.293107 b84 Common: CStarWindStorageDevice::hangedTaskExist: Storage device hanged operation: location = E:\SSD Storage\SSD Storage.img, request = 0x0000007DB4F8AE50, cdb[0] = 0x8A, cdb[1] = 0x00, buf_size = 262144, execution time (9719 msec) more than timeout (7000 msec)!
12/3 16:01:45.294283 9c4 IMG: ImageFile_IoCompleted: Warning(Time FileIO): request(0x0000007DA01BE640) ssc(0x0000007DB61F48E0) function(Execute SCSI Command) opCode(0x8A), timeFileIO = 9761 ms, g_cmdExecTimeWarningLimitInSec = 3 s. Device: 'E:\SSD Storage\SSD Storage.img'
12/3 16:01:45.294809 9d0 IMG: ImageFile_IoCompleted: Warning(Time FileIO): request(0x0000007DA01AC190) ssc(0x0000007DB6261DD0) function(Execute SCSI Command) opCode(0x8A), timeFileIO = 8803 ms, g_cmdExecTimeWarningLimitInSec = 3 s. Device: 'E:\SSD Storage\SSD Storage.img'
12/3 16:01:45.295442 9c8 IMG: ImageFile_IoCompleted: Warning(Time FileIO): request(0x0000007DA01AF7E0) ssc(0x0000007DB314E730) function(Execute SCSI Command) opCode(0x8A), timeFileIO = 9456 ms, g_cmdExecTimeWarningLimitInSec = 3 s. Device: 'E:\SSD Storage\SSD Storage.img'
12/3 16:01:45.295509 9cc IMG: ImageFile_IoCompleted: Warning(Time FileIO): request(0x0000007DA01AEAE0) ssc(0x0000007DAC7C6E00) function(Execute SCSI Command) opCode(0x8A), timeFileIO = 8228 ms, g_cmdExecTimeWarningLimitInSec = 3 s. Device: 'E:\SSD Storage\SSD Storage.img'
12/3 16:01:45.295921 b84 Common: CStarWindStorageDevice::hangedTaskExist: Storage device hanged operation: location = E:\SSD Storage\SSD Storage.img, request = 0x0000007DB4F8C660, cdb[0] = 0x8A, cdb[1] = 0x00, buf_size = 262144, execution time (9703 msec) more than timeout (7000 msec)!
12/3 16:01:45.296119 9cc IMG: ImageFile_ScsiCompleteRequest: Warning(Time Request EXEC): request(0x0000007DA01AEAE0) ssc(0x0000007DAC7C6E00) function(Execute SCSI Command) opCode(0x8A), timeExecRequest = 8229 ms, g_cmdExecTimeWarningLimitInSec = 3 s. Device: 'E:\SSD Storage\SSD Storage.img'
12/3 16:01:45.296061 9d0 IMG: ImageFile_ScsiCompleteRequest: Warning(Time Request EXEC): request(0x0000007DA01AC190) ssc(0x0000007DB6261DD0) function(Execute SCSI Command) opCode(0x8A), timeExecRequest = 8805 ms, g_cmdExecTimeWarningLimitInSec = 3 s. Device: 'E:\SSD Storage\SSD Storage.img'
12/3 16:01:45.296089 9c8 IMG: ImageFile_ScsiCompleteRequest: Warning(Time Request EXEC): request(0x0000007DA01AF7E0) ssc(0x0000007DB314E730) function(Execute SCSI Command) opCode(0x8A), timeExecRequest = 9457 ms, g_cmdExecTimeWarningLimitInSec = 3 s. Device: 'E:\SSD Storage\SSD Storage.img'
12/3 16:01:45.296034 9c4 IMG: ImageFile_ScsiCompleteRequest: Warning(Time Request EXEC): request(0x0000007DA01BE640) ssc(0x0000007DB61F48E0) function(Execute SCSI Command) opCode(0x8A), timeExecRequest = 9763 ms, g_cmdExecTimeWarningLimitInSec = 3 s. Device: 'E:\SSD Storage\SSD Storage.img'
12/3 16:01:45.296139 b84 Common: CStarWindStorageDevice::hangedTaskExist: Storage device hanged operation: location = E:\SSD Storage\SSD Storage.img, request = 0x0000007DB4F8DB00, cdb[0] = 0x8A, cdb[1] = 0x00, buf_size = 4096, execution time (8797 msec) more than timeout (7000 msec)!
12/3 16:01:45.296161 9cc Common: CStarWindStorageDevice::AsyncReadWriteCompleted: Underlying storage request(0x0000007DAD546060, opcode 0x8A) execution time is 8234 ms.
12/3 16:01:45.296185 9d0 Common: CStarWindStorageDevice::AsyncReadWriteCompleted: Underlying storage request(0x0000007DB4F8DB00, opcode 0x8A) execution time is 8812 ms.
12/3 16:01:45.296209 9c8 Common: CStarWindStorageDevice::AsyncReadWriteCompleted: Underlying storage request(0x0000007DAD59E1F0, opcode 0x8A) execution time is 9453 ms.
12/3 16:01:45.296240 9c4 Common: CStarWindStorageDevice::AsyncReadWriteCompleted: Underlying storage request(0x0000007DB4F8A770, opcode 0x8A) execution time is 9765 ms.
12/3 16:01:45.296256 b84 Common: CStarWindStorageDevice::hangedTaskExist: Storage device hanged operation: location = E:\SSD Storage\SSD Storage.img, request = 0x0000007DB3827050, cdb[0] = 0x8A, cdb[1] = 0x00, buf_size = 262144, execution time (9000 msec) more than timeout (7000 msec)!
12/3 16:01:45.296278 9cc HA: Ssc_Request_Task::complete: Warning(Underlying storage): sscRequestTask(0x0000007DAD546030) storageRequest(0x0000007DAD546060) GeneralFunction(Execute SCSI Command) GeneralOpCode(0x8A) ex(0x0) buf_size(8192), StorageOpCode(0x8A), StorageDiffTimeCompleteEXEC = 8234 ms, DiffTimeCompleteINIT = 8234 ms, DiffTimeCompleteEXEC = 8234 ms. Current target: 'iqn.2008-08.com.starwindsoftware:vsan2-csv3-ssd'
12/3 16:01:45.296296 9d0 HA: Ssc_Request_Task::complete: Warning(Underlying storage): sscRequestTask(0x0000007DB4F8DAD0) storageRequest(0x0000007DB4F8DB00) GeneralFunction(Execute SCSI Command) GeneralOpCode(0x8A) ex(0x0) buf_size(4096), StorageOpCode(0x8A), StorageDiffTimeCompleteEXEC = 8812 ms, DiffTimeCompleteINIT = 8812 ms, DiffTimeCompleteEXEC = 8812 ms. Current target: 'iqn.2008-08.com.starwindsoftware:vsan2-csv3-ssd'
12/3 16:01:45.296315 9c8 HA: Ssc_Request_Task::complete: Warning(Underlying storage): sscRequestTask(0x0000007DAD59E1C0) storageRequest(0x0000007DAD59E1F0) GeneralFunction(Execute SCSI Command) GeneralOpCode(0xF2) ex(0x0) buf_size(262144), StorageOpCode(0x8A), StorageDiffTimeCompleteEXEC = 9453 ms, DiffTimeCompleteINIT = 9453 ms, DiffTimeCompleteEXEC = 9453 ms. Current target: 'iqn.2008-08.com.starwindsoftware:vsan2-csv3-ssd'
12/3 16:01:45.296459 9d0 IMG: ImageFile_IoCompleted: Warning(Time FileIO): request(0x0000007DA01B2A20) ssc(0x0000007DAD476D30) function(Execute SCSI Command) opCode(0x8A), timeFileIO = 8229 ms, g_cmdExecTimeWarningLimitInSec = 3 s. Device: 'E:\SSD Storage\SSD Storage.img'
12/3 16:01:45.296390 b84 Common: CStarWindStorageDevice::hangedTaskExist: Storage device hanged operation: location = E:\SSD Storage\SSD Storage.img, request = 0x0000007DB382A3E0, cdb[0] = 0x8A, cdb[1] = 0x00, buf_size = 262144, execution time (9563 msec) more than timeout (7000 msec)!
12/3 16:01:45.296484 9d0 IMG: ImageFile_ScsiCompleteRequest: Warning(Time Request EXEC): request(0x0000007DA01B2A20) ssc(0x0000007DAD476D30) function(Execute SCSI Command) opCode(0x8A), timeExecRequest = 8229 ms, g_cmdExecTimeWarningLimitInSec = 3 s. Device: 'E:\SSD Storage\SSD Storage.img'
12/3 16:01:45.296333 9c4 HA: Ssc_Request_Task::complete: Warning(Underlying storage): sscRequestTask(0x0000007DB4F8A740) storageRequest(0x0000007DB4F8A770) GeneralFunction(Execute SCSI Command) GeneralOpCode(0xF2) ex(0x0) buf_size(262144), StorageOpCode(0x8A), StorageDiffTimeCompleteEXEC = 9765 ms, DiffTimeCompleteINIT = 9765 ms, DiffTimeCompleteEXEC = 9765 ms. Current target: 'iqn.2008-08.com.starwindsoftware:vsan2-csv3-ssd'
12/3 16:01:45.296477 9c8 IMG: ImageFile_IoCompleted: Warning(Time FileIO): request(0x0000007DA01AF570) ssc(0x0000007DB61FB870) function(Execute SCSI Command) opCode(0x8A), timeFileIO = 7717 ms, g_cmdExecTimeWarningLimitInSec = 3 s. Device: 'E:\SSD Storage\SSD Storage.img'
12/3 16:01:45.296491 b84 Common: CStarWindStorageDevice::hangedTaskExist: Storage device hanged operation: location = E:\SSD Storage\SSD Storage.img, request = 0x0000007DB382B1A0, cdb[0] = 0x8A, cdb[1] = 0x00, buf_size = 262144, execution time (9719 msec) more than timeout (7000 msec)!
12/3 16:01:45.296430 9cc IMG: ImageFile_IoCompleted: Warning(Time FileIO): request(0x0000007DA01B4350) ssc(0x0000007DB59B4250) function(Execute SCSI Command) opCode(0x8A), timeFileIO = 8229 ms, g_cmdExecTimeWarningLimitInSec = 3 s. Device: 'E:\SSD Storage\SSD Storage.img'
12/3 16:01:45.296517 9d0 Common: CStarWindStorageDevice::AsyncReadWriteCompleted: Underlying storage request(0x0000007DB3CE2780, opcode 0x8A) execution time is 8234 ms.
12/3 16:01:45.296551 9c4 IMG: ImageFile_IoCompleted: Warning(Time FileIO): request(0x0000007DA01A37A0) ssc(0x0000007DA5F05130) function(Execute SCSI Command) opCode(0x8A), timeFileIO = 7717 ms, g_cmdExecTimeWarningLimitInSec = 3 s. Device: 'E:\SSD Storage\SSD Storage.img'
12/3 16:01:45.296568 9c8 IMG: ImageFile_ScsiCompleteRequest: Warning(Time Request EXEC): request(0x0000007DA01AF570) ssc(0x0000007DB61FB870) function(Execute SCSI Command) opCode(0x8A), timeExecRequest = 7717 ms, g_cmdExecTimeWarningLimitInSec = 3 s. Device: 'E:\SSD Storage\SSD Storage.img'
12/3 16:01:45.296722 9c8 Common: CStarWindStorageDevice::AsyncReadWriteCompleted: Underlying storage request(0x0000007DAD544BC0, opcode 0x8A) execution time is 7718 ms.
12/3 16:01:45.296606 9cc IMG: ImageFile_ScsiCompleteRequest: Warning(Time Request EXEC): request(0x0000007DA01B4350) ssc(0x0000007DB59B4250) function(Execute SCSI Command) opCode(0x8A), timeExecRequest = 8229 ms, g_cmdExecTimeWarningLimitInSec = 3 s. Device: 'E:\SSD Storage\SSD Storage.img'
12/3 16:01:45.296625 9d0 HA: Ssc_Request_Task::complete: Warning(Underlying storage): sscRequestTask(0x0000007DB3CE2750) storageRequest(0x0000007DB3CE2780) GeneralFunction(Execute SCSI Command) GeneralOpCode(0x8A) ex(0x0) buf_size(8192), StorageOpCode(0x8A), StorageDiffTimeCompleteEXEC = 8234 ms, DiffTimeCompleteINIT = 8234 ms, DiffTimeCompleteEXEC = 8234 ms. Current target: 'iqn.2008-08.com.starwindsoftware:vsan2-csv3-ssd'
12/3 16:01:45.296641 9c4 IMG: ImageFile_ScsiCompleteRequest: Warning(Time Request EXEC): request(0x0000007DA01A37A0) ssc(0x0000007DA5F05130) function(Execute SCSI Command) opCode(0x8A), timeExecRequest = 7717 ms, g_cmdExecTimeWarningLimitInSec = 3 s. Device: 'E:\SSD Storage\SSD Storage.img'
12/3 16:01:45.296959 9c4 Common: CStarWindStorageDevice::AsyncReadWriteCompleted: Underlying storage request(0x0000007DB2AA0E10, opcode 0x8A) execution time is 7718 ms.
12/3 16:01:45.296979 9c4 HA: Ssc_Request_Task::complete: Warning(Underlying storage): sscRequestTask(0x0000007DB2AA0DE0) storageRequest(0x0000007DB2AA0E10) GeneralFunction(Execute SCSI Command) GeneralOpCode(0x8A) ex(0x0) buf_size(4096), StorageOpCode(0x8A), StorageDiffTimeCompleteEXEC = 7718 ms, DiffTimeCompleteINIT = 7718 ms, DiffTimeCompleteEXEC = 7718 ms. Current target: 'iqn.2008-08.com.starwindsoftware:vsan2-csv3-ssd'
12/3 16:01:45.296896 9cc Common: CStarWindStorageDevice::AsyncReadWriteCompleted: Underlying storage request(0x0000007DACF43830, opcode 0x8A) execution time is 8234 ms.
12/3 16:01:45.297036 9cc HA: Ssc_Request_Task::complete: Warning(Underlying storage): sscRequestTask(0x0000007DACF43800) storageRequest(0x0000007DACF43830) GeneralFunction(Execute SCSI Command) GeneralOpCode(0x8A) ex(0x0) buf_size(8192), StorageOpCode(0x8A), StorageDiffTimeCompleteEXEC = 8234 ms, DiffTimeCompleteINIT = 8234 ms, DiffTimeCompleteEXEC = 8234 ms. Current target: 'iqn.2008-08.com.starwindsoftware:vsan2-csv3-ssd'
12/3 16:01:45.296745 9c8 HA: Ssc_Request_Task::complete: Warning(Underlying storage): sscRequestTask(0x0000007DAD544B90) storageRequest(0x0000007DAD544BC0) GeneralFunction(Execute SCSI Command) GeneralOpCode(0x8A) ex(0x0) buf_size(4096), StorageOpCode(0x8A), StorageDiffTimeCompleteEXEC = 7718 ms, DiffTimeCompleteINIT = 7718 ms, DiffTimeCompleteEXEC = 7718 ms. Current target: 'iqn.2008-08.com.starwindsoftware:vsan2-csv3-ssd'
12/3 16:01:45.296586 b84 Common: CStarWindStorageDevice::hangedTaskExist: Storage device hanged operation: location = E:\SSD Storage\SSD Storage.img, request = 0x0000007DAD53E4A0, cdb[0] = 0x8A, cdb[1] = 0x00, buf_size = 4096, execution time (7703 msec) more than timeout (7000 msec)!
12/3 16:01:45.297191 b84 Common: CStarWindStorageDevice::hangedTaskExist: Storage device hanged operation: location = E:\SSD Storage\SSD Storage.img, request = 0x0000007DAD53EB80, cdb[0] = 0x8A, cdb[1] = 0x00, buf_size = 4096, execution time (7703 msec) more than timeout (7000 msec)!
12/3 16:01:45.297206 b84 Common: CStarWindStorageDevice::hangedTaskExist: Storage device hanged operation: location = E:\SSD Storage\SSD Storage.img, request = 0x0000007DAD544BC0, cdb[0] = 0x8A, cdb[1] = 0x00, buf_size = 4096, execution time (7703 msec) more than timeout (7000 msec)!
12/3 16:01:45.297211 9c8 IMG: ImageFile_IoCompleted: Warning(Time FileIO): request(0x0000007DA01B3720) ssc(0x0000007DB35AD640) function(Execute SCSI Command) opCode(0x8A), timeFileIO = 7718 ms, g_cmdExecTimeWarningLimitInSec = 3 s. Device: 'E:\SSD Storage\SSD Storage.img'
12/3 16:01:45.297244 9c8 IMG: ImageFile_ScsiCompleteRequest: Warning(Time Request EXEC): request(0x0000007DA01B3720) ssc(0x0000007DB35AD640) function(Execute SCSI Command) opCode(0x8A), timeExecRequest = 7718 ms, g_cmdExecTimeWarningLimitInSec = 3 s. Device: 'E:\SSD Storage\SSD Storage.img'
12/3 16:01:45.297218 b84 Common: CStarWindStorageDevice::hangedTaskExist: Storage device hanged operation: location = E:\SSD Storage\SSD Storage.img, request = 0x0000007DAD546060, cdb[0] = 0x8A, cdb[1] = 0x00, buf_size = 8192, execution time (8219 msec) more than timeout (7000 msec)!
12/3 16:01:45.297270 9c8 Common: CStarWindStorageDevice::AsyncReadWriteCompleted: Underlying storage request(0x0000007DACC6F1D0, opcode 0x8A) execution time is 7718 ms.
12/3 16:01:45.297299 9c8 HA: Ssc_Request_Task::complete: Warning(Underlying storage): sscRequestTask(0x0000007DACC6F1A0) storageRequest(0x0000007DACC6F1D0) GeneralFunction(Execute SCSI Command) GeneralOpCode(0x8A) ex(0x0) buf_size(4096), StorageOpCode(0x8A), StorageDiffTimeCompleteEXEC = 7718 ms, DiffTimeCompleteINIT = 7718 ms, DiffTimeCompleteEXEC = 7718 ms. Current target: 'iqn.2008-08.com.starwindsoftware:vsan2-csv3-ssd'
12/3 16:01:45.297284 b84 Common: CStarWindStorageDevice::hangedTaskExist: Storage device hanged operation: location = E:\SSD Storage\SSD Storage.img, request = 0x0000007DAD54A890, cdb[0] = 0x8A, cdb[1] = 0x00, buf_size = 262144, execution time (9563 msec) more than timeout (7000 msec)!
12/3 16:01:45.297339 b84 Common: CStarWindStorageDevice::hangedTaskExist: Storage device hanged operation: location = E:\SSD Storage\SSD Storage.img, request = 0x0000007DAD59EFB0, cdb[0] = 0x8A, cdb[1] = 0x00, buf_size = 262144, execution time (7422 msec) more than timeout (7000 msec)!
12/3 16:01:45.297351 b84 Common: CStarWindStorageDevice::hangedTaskExist: Storage device hanged operation: location = E:\SSD Storage\SSD Storage.img, request = 0x0000007DAD5A0450, cdb[0] = 0x8A, cdb[1] = 0x00, buf_size = 262144, execution time (9719 msec) more than timeout (7000 msec)!
12/3 16:01:45.297362 b84 Common: CStarWindStorageDevice::hangedTaskExist: Storage device hanged operation: location = E:\SSD Storage\SSD Storage.img, request = 0x0000007DAD59E1F0, cdb[0] = 0x8A, cdb[1] = 0x00, buf_size = 262144, execution time (9438 msec) more than timeout (7000 msec)!
12/3 16:01:45.297373 b84 Common: CStarWindStorageDevice::hangedTaskExist: Storage device hanged operation: location = E:\SSD Storage\SSD Storage.img, request = 0x0000007DAD5A9F00, cdb[0] = 0x8A, cdb[1] = 0x00, buf_size = 262144, execution time (9563 msec) more than timeout (7000 msec)!
12/3 16:01:45.297384 b84 Common: CStarWindStorageDevice::hangedTaskExist: Storage device hanged operation: location = E:\SSD Storage\SSD Storage.img, request = 0x0000007DAD5AD290, cdb[0] = 0x8A, cdb[1] = 0x00, buf_size = 262144, execution time (9078 msec) more than timeout (7000 msec)!
12/3 16:01:45.297394 b84 Common: CStarWindStorageDevice::hangedTaskExist: Storage device hanged operation: location = E:\SSD Storage\SSD Storage.img, request = 0x0000007DAD5AC840, cdb[0] = 0x8A, cdb[1] = 0x00, buf_size = 262144, execution time (9563 msec) more than timeout (7000 msec)!
12/3 16:01:45.297405 b84 Common: CStarWindStorageDevice::hangedTaskExist: Storage device hanged operation: location = E:\SSD Storage\SSD Storage.img, request = 0x0000007DAD5ACF20, cdb[0] = 0x8A, cdb[1] = 0x00, buf_size = 262144, execution time (9563 msec) more than timeout (7000 msec)!
12/3 16:01:45.297416 b84 Common: CStarWindStorageDevice::hangedTaskExist: Storage device hanged operation: location = E:\SSD Storage\SSD Storage.img, request = 0x0000007DAD5B1750, cdb[0] = 0x8A, cdb[1] = 0x00, buf_size = 8192, execution time (8219 msec) more than timeout (7000 msec)!
12/3 16:01:45.297427 b84 Common: CStarWindStorageDevice::hangedTaskExist: Storage device hanged operation: location = E:\SSD Storage\SSD Storage.img, request = 0x0000007DAD5B99F0, cdb[0] = 0x8A, cdb[1] = 0x00, buf_size = 36864, execution time (9766 msec) more than timeout (7000 msec)!
12/3 16:01:45.297501 b84 Common: CStarWindStorageDevice::hangedTaskExist: Storage device hanged operation: location = E:\SSD Storage\SSD Storage.img, request = 0x0000007DAD5BAE90, cdb[0] = 0x8A, cdb[1] = 0x00, buf_size = 262144, execution time (9563 msec) more than timeout (7000 msec)!
12/3 16:01:45.297516 b84 Common: CStarWindStorageDevice::hangedTaskExist: Storage device hanged operation: location = E:\SSD Storage\SSD Storage.img, request = 0x0000007DB37911A0, cdb[0] = 0x8A, cdb[1] = 0x00, buf_size = 262144, execution time (9703 msec) more than timeout (7000 msec)!
12/3 16:01:45.297426 9c8 IMG: ImageFile_IoCompleted: Warning(Time FileIO): request(0x0000007DA01A32C0) ssc(0x0000007DAC8C3B60) function(Execute SCSI Command) opCode(0x8A), timeFileIO = 7718 ms, g_cmdExecTimeWarningLimitInSec = 3 s. Device: 'E:\SSD Storage\SSD Storage.img'
my question is, can we fine tune the timeout in the starwind.cfg?

Code: Select all

 <!--Storage performance degradation time limit in milliseconds. 
          By default, this value = 7000 milliseconds.-->
      <StorPerfDegTimeLimitMs value="7000"/>
      <!--Period of time to keep information about storage performance degradation events in minutes. 
          By default, this value = 3 minutes.-->
      <StorPerfDegDetectPeriodMin value="3"/>
yaroslav (staff)
Staff
Posts: 2960
Joined: Mon Nov 18, 2019 11:11 am

Sun Feb 25, 2024 10:52 pm

Great!
Thanks for sharing!! You are thinking in right direction, but let us focus on other area.
It is not the timeouts though who causes the reconnecting state, but probably synchronization drop that is induced by those timeouts. Please seek for events like device changed its status to 3 from (I simply scan for to 3 from entries) and see if timeouts precede those. Furthermore, the right value needs to be applied. The timeouts not necessarily lead to synchronization drops too.
These are right timeouts to tweak, but I think this change will not solve the problem but mask it. It is important to approach the issue right.
This being said, can you please let me know
1. Does reconnection and sync drops happen during backups?
2. Does reconnecting state go in 30 mins after being discovered?
3. Do delays precede synchronization drops?
4. Do you have write-back cache on your devices?
logitech
Posts: 32
Joined: Sun Feb 04, 2024 9:50 am

Mon Feb 26, 2024 10:20 am

12/21 21:15:36.116065 b84 Common: CStarWindStorageDevice::hangedTaskExist: Storage device hanged operation: location = E:\SSD Storage\SSD Storage.img, request = 0x0000007DCD0FE2B0, cdb[0] = 0x8A, cdb[1] = 0x00, buf_size = 262144, execution time (13203 msec) more than timeout (7000 msec)!
12/21 21:15:36.116092 b84 Common: CStarWindStorageDevice::getLastPerfDegEventElapsedTimeMs: perfDegEventElapsedTime = 0
12/21 21:15:36.116113 b84 func: <<< HANode::onScsiOp_CustomControl_HeartbeatPing
12/21 21:15:36.116140 b84 func: >>> HANode::onScsiOp_CustomControl_HeartbeatPingShutdown
12/21 21:15:36.116156 b84 HA: HANode::onScsiOp_CustomControl_HeartbeatPingShutdown: Event: 'not restorable' flag is 0, shutdown reason 'timeout on local underlying storage device', async update header flag is 1. Command from iqn.2008-08.com.starwindsoftware:vsan1-csv3-ssd.
12/21 21:15:36.116176 b84 func: >>> HANode::reportEvent
12/21 21:15:36.116195 b84 func: >>> iScsiServer::sendNotification
12/21 21:15:36.116213 b84 func: >>> CEventDataBase::AddRecord
12/21 21:15:36.116230 b84 EventDB: CEventDataBase::AddRecord: Code 798, severity 3, additional strings: 2
12/21 21:15:36.116282 b84 func: <<< CEventDataBase::AddRecord
12/21 21:15:36.116300 b84 Srv: iScsiServer::sendNotification: Send reaction start (reaction type 2, msg code 798).
12/21 21:15:36.116741 b84 Srv: iScsiServer::sendNotification: Send reaction finish.
12/21 21:15:36.116775 b84 func: <<< iScsiServer::sendNotification
12/21 21:15:36.116794 b84 func: <<< HANode::reportEvent
12/21 21:15:36.116813 b84 func: >>> HANode::shutdownCurrentNode_DegStor
12/21 21:15:36.116832 b84 func: >>> HANode::shutdownCurrentNode
12/21 21:15:36.116850 b84 HA: HANode::shutdownCurrentNode: shutdown iqn.2008-08.com.starwindsoftware:vsan2-csv3-ssd
12/21 21:15:36.116868 b84 func: >>> HANode::shutdownCurrentNode_internal
12/21 21:15:36.116985 b84 HA: HANode::setSyncStatus: Event: Device changed own sync status to 3 from 1. (target 'iqn.2008-08.com.starwindsoftware:vsan2-csv3-ssd')
12/21 21:15:36.117011 b84 func: >>> HANode::reportEvent
12/21 21:15:36.117030 b84 func: >>> iScsiServer::sendNotification
12/21 21:15:36.117048 b84 func: >>> CEventDataBase::AddRecord
12/21 21:15:36.117066 b84 EventDB: CEventDataBase::AddRecord: Code 775, severity 3, additional strings: 1
12/21 21:15:36.117115 b84 func: <<< CEventDataBase::AddRecord
12/21 21:15:36.117134 b84 Srv: iScsiServer::sendNotification: Send reaction start (reaction type 2, msg code 775).
12/21 21:15:36.117522 b84 Srv: iScsiServer::sendNotification: Send reaction finish.
12/21 21:15:36.117555 b84 func: <<< iScsiServer::sendNotification
12/21 21:15:36.117575 b84 func: <<< HANode::reportEvent
12/21 21:15:36.117593 b84 HA: HANode::setIsNodeActive: Params(active = 0) ENTERed for target 'iqn.2008-08.com.starwindsoftware:vsan2-csv3-ssd'.
12/21 21:15:36.117625 b84 Plugin: iScsiPlugin::pluginCallback: Target iqn.2008-08.com.starwindsoftware:vsan2-csv3-ssd: Stopping session 0xf1f7 from iqn.1991-05.com.microsoft:vh02.vhc.local,400001370000...
12/21 21:15:36.117694 f74 Srv: *** SwSocket::Recv: Swn_SocketRecv() failed with error 10004 (0x2714)!
12/21 21:15:36.117734 f74 C[f1f7], LIN: iScsiConnection::recvData: Recv returned 10004 (0x2714)!
12/21 21:15:36.117695 b84 Plugin: iScsiPlugin::pluginCallback: Target iqn.2008-08.com.starwindsoftware:vsan2-csv3-ssd: Stopping session 0xf1f8 from iqn.1991-05.com.microsoft:vh02.vhc.local,400001370004...
12/21 21:15:36.117751 f74 C[f1f7], LIN: iScsiConnection::receive: recvData returned error 10004 (0x2714)!
12/21 21:15:36.117815 f74 C[f1f7], LIN: iScsiConnection::recvWorker: *** 'recv' thread: recv failed 10058.
12/21 21:15:36.117823 b84 Plugin: iScsiPlugin::pluginCallback: Target iqn.2008-08.com.starwindsoftware:vsan2-csv3-ssd: Stopping session 0xf1f9 from iqn.1991-05.com.microsoft:vh03.vhc.local,400001370004...
12/21 21:15:36.117829 126c Srv: *** SwSocket::Recv: Swn_SocketRecv() failed with error 10004 (0x2714)!
12/21 21:15:36.118008 1460 Srv: *** SwSocket::Recv: Swn_SocketRecv() failed with error 10004 (0x2714)!
12/21 21:15:36.118003 b84 Plugin: iScsiPlugin::pluginCallback: Target iqn.2008-08.com.starwindsoftware:vsan2-csv3-ssd: Stopping session 0xf1fa from iqn.1991-05.com.microsoft:vh03.vhc.local,400001370000...
12/21 21:15:36.118021 126c C[f1f8], LIN: iScsiConnection::recvData: Recv returned 10004 (0x2714)!
12/21 21:15:36.118044 1460 C[f1f9], LIN: iScsiConnection::recvData: Recv returned 10004 (0x2714)!
12/21 21:15:36.118096 126c C[f1f8], LIN: iScsiConnection::receive: recvData returned error 10004 (0x2714)!
12/21 21:15:36.118103 b84 Plugin: iScsiPlugin::pluginCallback: Target iqn.2008-08.com.starwindsoftware:vsan2-csv3-ssd: Stopping session 0xf1fb from iqn.1991-05.com.microsoft:vh01.vhc.local,400001370000...
12/21 21:15:36.118116 344 Srv: *** SwSocket::Recv: Swn_SocketRecv() failed with error 10004 (0x2714)!
12/21 21:15:36.118130 1460 C[f1f9], LIN: iScsiConnection::receive: recvData returned error 10004 (0x2714)!
12/21 21:15:36.118158 126c C[f1f8], LIN: iScsiConnection::recvWorker: *** 'recv' thread: recv failed 10058.
12/21 21:15:36.118215 17f0 Srv: *** SwSocket::Recv: Swn_SocketRecv() failed with error 10004 (0x2714)!
12/21 21:15:36.118217 b84 Plugin: iScsiPlugin::pluginCallback: Target iqn.2008-08.com.starwindsoftware:vsan2-csv3-ssd: Stopping session 0xf1fc from iqn.1991-05.com.microsoft:vh01.vhc.local,400001370007...
12/21 21:15:36.118311 344 C[f1fa], LIN: iScsiConnection::recvData: Recv returned 10004 (0x2714)!
12/21 21:15:36.118445 344 C[f1fa], LIN: iScsiConnection::receive: recvData returned error 10004 (0x2714)!
12/21 21:15:36.118550 344 C[f1fa], LIN: iScsiConnection::recvWorker: *** 'recv' thread: recv failed 10058.
12/21 21:15:36.118398 17f0 C[f1fb], LIN: iScsiConnection::recvData: Recv returned 10004 (0x2714)!
12/21 21:15:36.118619 17f0 C[f1fb], LIN: iScsiConnection::receive: recvData returned error 10004 (0x2714)!
12/21 21:15:36.118726 17f0 C[f1fb], LIN: iScsiConnection::recvWorker: *** 'recv' thread: recv failed 10058.
12/21 21:15:36.118754 1208 Srv: *** SwSocket::Recv: Swn_SocketRecv() failed with error 10004 (0x2714)!
12/21 21:15:36.118818 1208 C[f1fc], LIN: iScsiConnection::recvData: Recv returned 10004 (0x2714)!
12/21 21:15:36.118848 1208 C[f1fc], LIN: iScsiConnection::receive: recvData returned error 10004 (0x2714)!
12/21 21:15:36.118865 1208 C[f1fc], LIN: iScsiConnection::recvWorker: *** 'recv' thread: recv failed 10058.
12/21 21:15:36.118941 b84 func: <<< HANode::shutdownCurrentNode_internal
12/21 21:15:36.118970 b84 func: <<< HANode::shutdownCurrentNode
12/21 21:15:36.118337 1460 C[f1f9], LIN: iScsiConnection::recvWorker: *** 'recv' thread: recv failed 10058.
12/21 21:15:36.119369 438 Srv: iScsiServer::listenConnections: Accepted iSCSI connection from 10.40.40.10:64725 to 10.40.40.20:3260. (Id = 0xf1fd)
12/21 21:15:36.119405 438 S[f1fd]: iScsiSession::iScsiSession: Session (0000007DA5F2E3C0)
12/21 21:15:36.119429 438 C[f1fd], FREE: iScsiConnection::doTransition: Event - CONNECTED.
12/21 21:15:36.119569 b84 func: <<< HANode::shutdownCurrentNode_DegStor
12/21 21:15:36.119607 b84 func: <<< HANode::onScsiOp_CustomControl_HeartbeatPingShutdown
12/21 21:15:36.119800 438 Srv: iScsiServer::listenConnections: Accepted iSCSI connection from 10.40.40.10:64724 to 10.40.40.20:3260. (Id = 0xf1fe)
12/21 21:15:36.119832 438 S[f1fe]: iScsiSession::iScsiSession: Session (0000007DA5F2EA00)
12/21 21:15:36.119854 438 C[f1fe], FREE: iScsiConnection::doTransition: Event - CONNECTED.
12/21 21:15:36.120173 438 Srv: iScsiServer::listenConnections: Accepted iSCSI connection from 10.40.40.15:52206 to 10.40.40.20:3260. (Id = 0xf1ff)
12/21 21:15:36.120207 438 S[f1ff]: iScsiSession::iScsiSession: Session (0000007DA5F2CB80)
12/21 21:15:36.120230 438 C[f1ff], FREE: iScsiConnection::doTransition: Event - CONNECTED.
12/21 21:15:36.120294 128c C[f1fe], XPT_UP: iScsiConnection::handleFirstLogin: Login request: ISID 0x400001370000, TSIH 0x0000.
12/21 21:15:36.120344 128c C[f1fe], XPT_UP: iScsiConnection::doTransition: Event - LOGIN.
12/21 21:15:36.120413 128c Params: iScsiParameter::update: <<< String param 'InitiatorName': received 'iqn.1991-05.com.microsoft:vh02.vhc.local', accepted 'iqn.1991-05.com.microsoft:vh02.vhc.local'
12/21 21:15:36.120444 128c Params: iScsiParameter::update: <<< Enum param 'SessionType': received 'Normal', accepted 'Normal'
12/21 21:15:36.120466 128c Params: iScsiParameter::update: <<< String param 'TargetName': received 'iqn.2008-08.com.starwindsoftware:vsan2-csv3-ssd', accepted 'iqn.2008-08.com.starwindsoftware:vsan2-csv3-ssd'
12/21 21:15:36.120482 128c Params: iScsiParameter::update: <<< Enum param 'AuthMethod': received 'None', accepted 'None'
12/21 21:15:36.120541 438 Srv: iScsiServer::listenConnections: Accepted iSCSI connection from 10.40.40.5:61265 to 10.40.40.20:3260. (Id = 0xf200)
12/21 21:15:36.120603 d8c C[f1ff], XPT_UP: iScsiConnection::handleFirstLogin: Login request: ISID 0x400001370004, TSIH 0x0000.
12/21 21:15:36.120644 d8c C[f1ff], XPT_UP: iScsiConnection::doTransition: Event - LOGIN.
12/21 21:15:36.120700 d8c Params: iScsiParameter::update: <<< String param 'InitiatorName': received 'iqn.1991-05.com.microsoft:vh03.vhc.local', accepted 'iqn.1991-05.com.microsoft:vh03.vhc.local'
12/21 21:15:36.120727 d8c Params: iScsiParameter::update: <<< Enum param 'SessionType': received 'Normal', accepted 'Normal'
12/21 21:15:36.120744 d8c Params: iScsiParameter::update: <<< String param 'TargetName': received 'iqn.2008-08.com.starwindsoftware:vsan2-csv3-ssd', accepted 'iqn.2008-08.com.starwindsoftware:vsan2-csv3-ssd'
12/21 21:15:36.120767 d8c Params: iScsiParameter::update: <<< Enum param 'AuthMethod': received 'None', accepted 'None'
12/21 21:15:36.121147 128c HA: HASyncNode::registerSession: Client initiator iqn.1991-05.com.microsoft:vh02.vhc.local is trying to register a session within the 'iqn.2008-08.com.starwindsoftware:vsan2-csv3-ssd' target... (sessId = 0xf1fe, initiatorNameIsid = iqn.1991-05.com.microsoft:vh02.vhc.local,400001370000)
12/21 21:15:36.121193 128c HA: HASyncNode::registerSession: Unable to register the new client session. The node is not active!
12/21 21:15:36.121208 128c HA: HASyncNode::registerSession: Return code 21.
12/21 21:15:36.121228 128c Tgt: *** iScsiTarget::openSession: iqn.2008-08.com.starwindsoftware:vsan2-csv3-ssd: can't register session. The device 'HAImage1' is not ready.
12/21 21:15:36.121242 128c T[f1fe,1]: ***iScsiTask::startLoginPhase: *ERROR* Login request: device open failed.
12/21 21:15:36.121320 438 S[f200]: iScsiSession::iScsiSession: Session (0000007DA5F2F000)
12/21 21:15:36.121366 15f4 C[f1fe], IN_LOGIN: iScsiConnection::doTransition: Event - LOGIN_REJECT.
12/21 21:15:36.121712 114c C[f1fd], XPT_UP: iScsiConnection::handleFirstLogin: Login request: ISID 0x400001370004, TSIH 0x0000.
12/21 21:15:36.121723 128c Srv: *** SwSocket::Recv: Swn_SocketRecv() failed with error 10054 (0x2746)!
12/21 21:15:36.121757 114c C[f1fd], XPT_UP: iScsiConnection::doTransition: Event - LOGIN.
12/21 21:15:36.121785 128c C[f1fe], IN_LOGIN: iScsiConnection::recvData: Recv returned 10054 (0x2746)!
12/21 21:15:36.121804 438 C[f200], FREE: iScsiConnection::doTransition: Event - CONNECTED.
12/21 21:15:36.121811 d8c HA: HASyncNode::registerSession: Client initiator iqn.1991-05.com.microsoft:vh03.vhc.local is trying to register a session within the 'iqn.2008-08.com.starwindsoftware:vsan2-csv3-ssd' target... (sessId = 0xf1ff, initiatorNameIsid = iqn.1991-05.com.microsoft:vh03.vhc.local,400001370004)
12/21 21:15:36.121835 128c C[f1fe], IN_LOGIN: iScsiConnection::receive: recvData returned error 10054 (0x2746)!
12/21 21:15:36.121851 114c Params: iScsiParameter::update: <<< String param 'InitiatorName': received 'iqn.1991-05.com.microsoft:vh02.vhc.local', accepted 'iqn.1991-05.com.microsoft:vh02.vhc.local'
12/21 21:15:36.121894 d8c HA: HASyncNode::registerSession: Unable to register the new client session. The node is not active!
12/21 21:15:36.121951 114c Params: iScsiParameter::update: <<< Enum param 'SessionType': received 'Normal', accepted 'Normal'
12/21 21:15:36.121984 d8c HA: HASyncNode::registerSession: Return code 21.
12/21 21:15:36.122018 114c Params: iScsiParameter::update: <<< String param 'TargetName': received 'iqn.2008-08.com.starwindsoftware:vsan2-csv3-ssd', accepted 'iqn.2008-08.com.starwindsoftware:vsan2-csv3-ssd'
12/21 21:15:36.122032 d8c Tgt: *** iScsiTarget::openSession: iqn.2008-08.com.starwindsoftware:vsan2-csv3-ssd: can't register session. The device 'HAImage1' is not ready.
12/21 21:15:36.122048 114c Params: iScsiParameter::update: <<< Enum param 'AuthMethod': received 'None', accepted 'None'
12/21 21:15:36.122066 d8c T[f1ff,1]: ***iScsiTask::startLoginPhase: *ERROR* Login request: device open failed.
12/21 21:15:36.122251 1710 C[f1ff], IN_LOGIN: iScsiConnection::doTransition: Event - LOGIN_REJECT.
12/21 21:15:36.122321 438 Srv: iScsiServer::listenConnections: Accepted iSCSI connection from 10.40.40.15:52207 to 10.40.40.20:3260. (Id = 0xf201)
12/21 21:15:36.122496 eac C[f200], XPT_UP: iScsiConnection::handleFirstLogin: Login request: ISID 0x400001370007, TSIH 0x0000.
12/21 21:15:36.122533 eac C[f200], XPT_UP: iScsiConnection::doTransition: Event - LOGIN.
12/21 21:15:36.122544 d8c C[f1ff], IN_LOGIN: iScsiConnection::recvData: Recv - peer shutdown
12/21 21:15:36.122579 d8c C[f1ff], IN_LOGIN: iScsiConnection::receive: recvData returned error 10058 (0x274a)!
12/21 21:15:36.122589 eac Params: iScsiParameter::update: <<< String param 'InitiatorName': received 'iqn.1991-05.com.microsoft:vh01.vhc.local', accepted 'iqn.1991-05.com.microsoft:vh01.vhc.local'
12/21 21:15:36.122617 eac Params: iScsiParameter::update: <<< Enum param 'SessionType': received 'Normal', accepted 'Normal'
12/21 21:15:36.122633 eac Params: iScsiParameter::update: <<< String param 'TargetName': received 'iqn.2008-08.com.starwindsoftware:vsan2-csv3-ssd', accepted 'iqn.2008-08.com.starwindsoftware:vsan2-csv3-ssd'
12/21 21:15:36.122659 eac Params: iScsiParameter::update: <<< Enum param 'AuthMethod': received 'None', accepted 'None'
12/21 21:15:36.122726 114c HA: HASyncNode::registerSession: Client initiator iqn.1991-05.com.microsoft:vh02.vhc.local is trying to register a session within the 'iqn.2008-08.com.starwindsoftware:vsan2-csv3-ssd' target... (sessId = 0xf1fd, initiatorNameIsid = iqn.1991-05.com.microsoft:vh02.vhc.local,400001370004)
12/21 21:15:36.122769 114c HA: HASyncNode::registerSession: Unable to register the new client session. The node is not active!
12/21 21:15:36.122797 114c HA: HASyncNode::registerSession: Return code 21.
12/21 21:15:36.122817 114c Tgt: *** iScsiTarget::openSession: iqn.2008-08.com.starwindsoftware:vsan2-csv3-ssd: can't register session. The device 'HAImage1' is not ready.
12/21 21:15:36.122831 114c T[f1fd,1]: ***iScsiTask::startLoginPhase: *ERROR* Login request: device open failed.
12/21 21:15:36.122854 700 S[f1fe]: iScsiSession::~iScsiSession: ~Session
12/21 21:15:36.122861 17b0 S[f1ff]: iScsiSession::~iScsiSession: ~Session
12/21 21:15:36.122942 15d4 C[f1fd], IN_LOGIN: iScsiConnection::doTransition: Event - LOGIN_REJECT.
12/21 21:15:36.122868 438 S[f201]: iScsiSession::iScsiSession: Session (0000007DA5F2F600)
12/21 21:15:36.123164 114c C[f1fd], IN_LOGIN: iScsiConnection::recvData: Recv - peer shutdown
12/21 21:15:36.123190 114c C[f1fd], IN_LOGIN: iScsiConnection::receive: recvData returned error 10058 (0x274a)!
12/21 21:15:36.123507 438 C[f201], FREE: iScsiConnection::doTransition: Event - CONNECTED.
12/21 21:15:36.123510 5f8 S[f1fd]: iScsiSession::~iScsiSession: ~Session
12/21 21:15:36.123512 eac HA: HASyncNode::registerSession: Client initiator iqn.1991-05.com.microsoft:vh01.vhc.local is trying to register a session within the 'iqn.2008-08.com.starwindsoftware:vsan2-csv3-ssd' target... (sessId = 0xf200, initiatorNameIsid = iqn.1991-05.com.microsoft:vh01.vhc.local,400001370007)
12/21 21:15:36.123614 eac HA: HASyncNode::registerSession: Unable to register the new client session. The node is not active!
12/21 21:15:36.123636 eac HA: HASyncNode::registerSession: Return code 21.
12/21 21:15:36.123649 eac Tgt: *** iScsiTarget::openSession: iqn.2008-08.com.starwindsoftware:vsan2-csv3-ssd: can't register session. The device 'HAImage1' is not ready.
12/21 21:15:36.123669 eac T[f200,1]: ***iScsiTask::startLoginPhase: *ERROR* Login request: device open failed.
12/21 21:15:36.123835 650 C[f200], IN_LOGIN: iScsiConnection::doTransition: Event - LOGIN_REJECT.
12/21 21:15:36.123967 438 Srv: iScsiServer::listenConnections: Accepted iSCSI connection from 10.40.40.5:61264 to 10.40.40.20:3260. (Id = 0xf202)
12/21 21:15:36.123975 680 C[f201], XPT_UP: iScsiConnection::handleFirstLogin: Login request: ISID 0x400001370000, TSIH 0x0000.
12/21 21:15:36.124009 438 S[f202]: iScsiSession::iScsiSession: Session (0000007DA5F2CB80)
12/21 21:15:36.124075 680 C[f201], XPT_UP: iScsiConnection::doTransition: Event - LOGIN.
12/21 21:15:36.124098 438 C[f202], FREE: iScsiConnection::doTransition: Event - CONNECTED.
12/21 21:15:36.124165 680 Params: iScsiParameter::update: <<< String param 'InitiatorName': received 'iqn.1991-05.com.microsoft:vh03.vhc.local', accepted 'iqn.1991-05.com.microsoft:vh03.vhc.local'
12/21 21:15:36.124190 680 Params: iScsiParameter::update: <<< Enum param 'SessionType': received 'Normal', accepted 'Normal'
12/21 21:15:36.124198 eac Srv: *** SwSocket::Recv: Swn_SocketRecv() failed with error 10054 (0x2746)!
12/21 21:15:36.124207 680 Params: iScsiParameter::update: <<< String param 'TargetName': received 'iqn.2008-08.com.starwindsoftware:vsan2-csv3-ssd', accepted 'iqn.2008-08.com.starwindsoftware:vsan2-csv3-ssd'
12/21 21:15:36.124238 eac C[f200], IN_LOGIN: iScsiConnection::recvData: Recv returned 10054 (0x2746)!
12/21 21:15:36.124271 680 Params: iScsiParameter::update: <<< Enum param 'AuthMethod': received 'None', accepted 'None'
12/21 21:15:36.124293 eac C[f200], IN_LOGIN: iScsiConnection::receive: recvData returned error 10054 (0x2746)!
12/21 21:15:36.124478 13b0 C[f202], XPT_UP: iScsiConnection::handleFirstLogin: Login request: ISID 0x400001370000, TSIH 0x0000.
12/21 21:15:36.124527 13b0 C[f202], XPT_UP: iScsiConnection::doTransition: Event - LOGIN.
12/21 21:15:36.124826 13b0 Params: iScsiParameter::update: <<< String param 'InitiatorName': received 'iqn.1991-05.com.microsoft:vh01.vhc.local', accepted 'iqn.1991-05.com.microsoft:vh01.vhc.local'
12/21 21:15:36.124856 13b0 Params: iScsiParameter::update: <<< Enum param 'SessionType': received 'Normal', accepted 'Normal'
12/21 21:15:36.124878 13b0 Params: iScsiParameter::update: <<< String param 'TargetName': received 'iqn.2008-08.com.starwindsoftware:vsan2-csv3-ssd', accepted 'iqn.2008-08.com.starwindsoftware:vsan2-csv3-ssd'
12/21 21:15:36.124895 13b0 Params: iScsiParameter::update: <<< Enum param 'AuthMethod': received 'None', accepted 'None'
12/21 21:15:36.125089 1554 S[f200]: iScsiSession::~iScsiSession: ~Session
12/21 21:15:36.125094 680 HA: HASyncNode::registerSession: Client initiator iqn.1991-05.com.microsoft:vh03.vhc.local is trying to register a session within the 'iqn.2008-08.com.starwindsoftware:vsan2-csv3-ssd' target... (sessId = 0xf201, initiatorNameIsid = iqn.1991-05.com.microsoft:vh03.vhc.local,400001370000)
12/21 21:15:36.125166 680 HA: HASyncNode::registerSession: Unable to register the new client session. The node is not active!
12/21 21:15:36.125186 680 HA: HASyncNode::registerSession: Return code 21.
12/21 21:15:36.125199 680 Tgt: *** iScsiTarget::openSession: iqn.2008-08.com.starwindsoftware:vsan2-csv3-ssd: can't register session. The device 'HAImage1' is not ready.
12/21 21:15:36.125218 680 T[f201,1]: ***iScsiTask::startLoginPhase: *ERROR* Login request: device open failed.
12/21 21:15:36.125333 c58 C[f201], IN_LOGIN: iScsiConnection::doTransition: Event - LOGIN_REJECT.
12/21 21:15:36.125607 680 C[f201], IN_LOGIN: iScsiConnection::recvData: Recv - peer shutdown
12/21 21:15:36.125631 680 C[f201], IN_LOGIN: iScsiConnection::receive: recvData returned error 10058 (0x274a)!
12/21 21:15:36.125838 1758 S[f201]: iScsiSession::~iScsiSession: ~Session
12/21 21:15:36.125851 13b0 HA: HASyncNode::registerSession: Client initiator iqn.1991-05.com.microsoft:vh01.vhc.local is trying to register a session within the 'iqn.2008-08.com.starwindsoftware:vsan2-csv3-ssd' target... (sessId = 0xf202, initiatorNameIsid = iqn.1991-05.com.microsoft:vh01.vhc.local,400001370000)
12/21 21:15:36.125916 13b0 HA: HASyncNode::registerSession: Unable to register the new client session. The node is not active!
12/21 21:15:36.125932 13b0 HA: HASyncNode::registerSession: Return code 21.
12/21 21:15:36.125950 13b0 Tgt: *** iScsiTarget::openSession: iqn.2008-08.com.starwindsoftware:vsan2-csv3-ssd: can't register session. The device 'HAImage1' is not ready.
12/21 21:15:36.125964 13b0 T[f202,1]: ***iScsiTask::startLoginPhase: *ERROR* Login request: device open failed.
12/21 21:15:36.126090 13a0 C[f202], IN_LOGIN: iScsiConnection::doTransition: Event - LOGIN_REJECT.
12/21 21:15:36.126343 13b0 Srv: *** SwSocket::Recv: Swn_SocketRecv() failed with error 10054 (0x2746)!
12/21 21:15:36.126372 13b0 C[f202], IN_LOGIN: iScsiConnection::recvData: Recv returned 10054 (0x2746)!
12/21 21:15:36.126387 13b0 C[f202], IN_LOGIN: iScsiConnection::receive: recvData returned error 10054 (0x2746)!
12/21 21:15:36.126471 1c8 S[f202]: iScsiSession::~iScsiSession: ~Session
12/21 21:15:36.150488 9c8 IMG: ImageFile_IoCompleted: Warning(Time FileIO): request(0x0000007DA01A00D0) ssc(0x0000007DB59A5750) function(Execute SCSI Command) opCode(0x8A), timeFileIO = 13458 ms, g_cmdExecTimeWarningLimitInSec = 3 s. Device: 'E:\SSD Storage\SSD Storage.img'
12/21 21:15:36.150519 9c8 IMG: ImageFile_ScsiCompleteRequest: Warning(Time Request EXEC): request(0x0000007DA01A00D0) ssc(0x0000007DB59A5750) function(Execute SCSI Command) opCode(0x8A), timeExecRequest = 13458 ms, g_cmdExecTimeWarningLimitInSec = 3 s. Device: 'E:\SSD Storage\SSD Storage.img'
12/21 21:15:36.150544 9c8 Common: CStarWindStorageDevice::AsyncReadWriteCompleted: Underlying storage request(0x0000007DAD3C1F60, opcode 0x8A) execution time is 13437 ms.
12/21 21:15:36.150570 9c8 func: >>> HANode::reportEvent
12/21 21:15:36.150596 9c8 func: >>> iScsiServer::sendNotification
12/21 21:15:36.150615 9c8 func: >>> CEventDataBase::AddRecord
12/21 21:15:36.150634 9c8 EventDB: CEventDataBase::AddRecord: Code 809, severity 2, additional strings: 3
12/21 21:15:36.150691 9c8 func: <<< CEventDataBase::AddRecord
12/21 21:15:36.150711 9c8 Srv: iScsiServer::sendNotification: Send reaction start (reaction type 2, msg code 809).
12/21 21:15:36.151188 9c8 Srv: iScsiServer::sendNotification: Send reaction finish.
12/21 21:15:36.151223 9c8 func: <<< iScsiServer::sendNotification
12/21 21:15:36.151248 9c8 func: <<< HANode::reportEvent
12/21 21:15:36.151267 9c8 HA: Ssc_Request_Task::complete: Warning(Underlying storage): sscRequestTask(0x0000007DAD3C1F30) storageRequest(0x0000007DAD3C1F60) GeneralFunction(Execute SCSI Command) GeneralOpCode(0x8A) ex(0x0) buf_size(262144), StorageOpCode(0x8A), StorageDiffTimeCompleteEXEC = 13437 ms, DiffTimeCompleteINIT = 13437 ms, DiffTimeCompleteEXEC = 13437 ms. Current target: 'iqn.2008-08.com.starwindsoftware:vsan2-csv3-ssd'
12/21 21:15:36.235590 9c8 IMG: ImageFile_IoCompleted: Warning(Time FileIO): request(0x0000007DA01AF980) ssc(0x0000007DB0C92660) function(Execute SCSI Command) opCode(0x8A), timeFileIO = 13537 ms, g_cmdExecTimeWarningLimitInSec = 3 s. Device: 'E:\SSD Storage\SSD Storage.img'
12/21 21:15:36.235623 9c8 IMG: ImageFile_ScsiCompleteRequest: Warning(Time Request EXEC): request(0x0000007DA01AF980) ssc(0x0000007DB0C92660) function(Execute SCSI Command) opCode(0x8A), timeExecRequest = 13537 ms, g_cmdExecTimeWarningLimitInSec = 3 s. Device: 'E:\SSD Storage\SSD Storage.img'
12/21 21:15:36.235648 9c8 Common: CStarWindStorageDevice::AsyncReadWriteCompleted: Underlying storage request(0x0000007DAD5B0990, opcode 0x8A) execution time is 13531 ms.
12/21 21:15:36.235669 9c8 func: >>> HANode::reportEvent
12/21 21:15:36.235690 9c8 func: >>> iScsiServer::sendNotification
12/21 21:15:36.235709 9c8 func: >>> CEventDataBase::AddRecord
12/21 21:15:36.235728 9c8 EventDB: CEventDataBase::AddRecord: Code 809, severity 2, additional strings: 3
12/21 21:15:36.235782 9c8 func: <<< CEventDataBase::AddRecord
12/21 21:15:36.235801 9c8 Srv: iScsiServer::sendNotification: Send reaction start (reaction type 2, msg code 809).
12/21 21:15:36.236220 9c8 Srv: iScsiServer::sendNotification: Send reaction finish.
12/21 21:15:36.236254 9c8 func: <<< iScsiServer::sendNotification
12/21 21:15:36.236274 9c8 func: <<< HANode::reportEvent
12/21 21:15:36.236297 9c8 HA: Ssc_Request_Task::complete: Warning(Underlying storage): sscRequestTask(0x0000007DAD5B0960) storageRequest(0x0000007DAD5B0990) GeneralFunction(Execute SCSI Command) GeneralOpCode(0xF2) ex(0x0) buf_size(262144), StorageOpCode(0x8A), StorageDiffTimeCompleteEXEC = 13531 ms, DiffTimeCompleteINIT = 13531 ms, DiffTimeCompleteEXEC = 13531 ms. Current target: 'iqn.2008-08.com.starwindsoftware:vsan2-csv3-ssd'
12/21 21:15:36.251100 b84 func: >>> HANode::onScsiOp_CustomControl_HeartbeatPing
12/21 21:15:36.251967 b84 Common: CStarWindStorageDevice::hangedTaskExist: Storage device hanged operation: location = E:\SSD Storage\SSD Storage.img, request = 0x0000007DA60DEAC0, cdb[0] = 0x8A, cdb[1] = 0x00, buf_size = 262144, execution time (13234 msec) more than timeout (9000 msec)!
1. Does reconnection and sync drops happen during backups?
there are random hourly backup jobs throughout the day/every day. and the issue is not happening daily.
2. Does reconnecting state go in 30 mins after being discovered?
I have no answer
3. Do delays precede synchronization drops?
I have no answer for that, because we used to run after user complaints that they lost access to some applications hosted on the VMs. we quickly start restarting the services etc...
4. Do you have write-back cache on your devices?
We have write-back cache enabled on the raid controller for this SSD logical volume / RAID5
Also, the strip size/full stripe size for this volume is 64KiB/256KiB. is this an optimal value for SSD logical volume?
Also, we have HPE SSD Smartpath enabled for this logical volume.
yaroslav (staff)
Staff
Posts: 2960
Joined: Mon Nov 18, 2019 11:11 am

Mon Feb 26, 2024 10:43 am

Thank you for your update.
Can I please have the full set of logs?
I have no answer for that, because we used to run after user complaints that they lost access to some applications hosted on the VMs. we quickly start restarting the services etc...
It does not look like an issue related to StarWind VSAN as there would be CSV-scale outage otherwise.
Thanks for your update.
Try disabling write-back cache (https://knowledgebase.starwindsoftware. ... -l1-cache/) and updating StarWind VSAN (https://knowledgebase.starwindsoftware. ... d-version/). Get StarWind VSAN latest build for Windows https://starwind.com/tmplink/starwind-v8.exe.

P.S. Check if stopping a service on one node leads to VM downtime.
logitech
Posts: 32
Joined: Sun Feb 04, 2024 9:50 am

Tue Feb 27, 2024 7:40 pm

i'm confused about the strip size / full stripe size configuration.
as per SW recommendations stripe size for SSD / RAID5 configuration is 64kb
are these the right settings?
strip size full stripe size.jpg
strip size full stripe size.jpg (116.74 KiB) Viewed 18599 times
yaroslav (staff)
Staff
Posts: 2960
Joined: Mon Nov 18, 2019 11:11 am

Tue Feb 27, 2024 8:33 pm

There could just too much I/O arriving to the VSAN. And as a result the threshold could have been exceeded. Can I please have the full set of logs from both nodes collected with log collector (https://knowledgebase.starwindsoftware. ... collector/) so I can look into the issue?
logitech
Posts: 32
Joined: Sun Feb 04, 2024 9:50 am

Wed Feb 28, 2024 2:15 pm

here you go. the logs are reachable on wetransfer link below. I couldn't attach it because the files are bigger than 30MB.

https://we.tl/t-E5so6iUmTA
yaroslav (staff)
Staff
Posts: 2960
Joined: Mon Nov 18, 2019 11:11 am

Wed Feb 28, 2024 2:46 pm

I will check those in good order.
logitech
Posts: 32
Joined: Sun Feb 04, 2024 9:50 am

Thu Feb 29, 2024 1:46 pm

yaroslav (staff) wrote:
Wed Feb 28, 2024 2:46 pm
I will check those in good order.
any update ?
Post Reply