Hi,
Thanks for logs.
I was not able to find any events on 01 log about the partner going out of sync.
NODE1
3/15 20:04:53.987656 182 HA: HASyncNode::fixTypeForSync: Synchronization method HA_SYNCH_TYPE_FAST will be changed to HA_SYNC_TYPE_FULL because storage IO errors were detected.
here is 01 changing its status to Synchronizing
3/15 20:04:53.994867 8b HA: HANode::setSyncStatus: Event: Device changed own sync status to 2 from 3. (target 'iqn.2008-08.com.starwindsoftware:10.20.10.22-ds2')
Note that sync was running only for DS2.
Also, there are no logs older than Mar 12, while another node contains older logs. Logs on 01 are flooded with 3/12 9:53:56.127478 f3 Tgt: *** iScsiTarget::openSession: iqn.2008-08.com.starwindsoftware:10.20.10.22-ds2: can't register session. The device 'HAImage2' is not ready.
3/12 9:53:56.127489 f3 T[15369,1]: ***iScsiTask::startLoginPhase: *ERROR* Login request: device open failed.
3/12 9:53:56.127608 18b C[15369], IN_LOGIN: iScsiConnection::doTransition: Event - LOGIN_REJECT.
3/12 9:53:56.378476 f3 C[15369], IN_LOGIN: iScsiConnection::recvData: Recv - peer shutdown
3/12 9:53:56.378529 f3 C[15369], IN_LOGIN: iScsiConnection::receive: recvData returned error 10058 (0x274a)!
3/12 9:53:56.379032 a8 S[15369]: iScsiSession::~iScsiSession: ~Session
3/12 9:53:58.068781 9 Srv: iScsiServer::listenConnections: Accepted iSCSI connection from 172.16.10.20:14920 to 172.16.10.22:3260. (Id = 0x1536a)
3/12 9:53:58.068839 9 S[1536a]: iScsiSession::iScsiSession: Session (00007FE23B1D1700)
3/12 9:53:58.068851 9 C[1536a], FREE: iScsiConnection::doTransition: Event - CONNECTED.
3/12 9:53:58.319782 1f2 C[1536a], XPT_UP: iScsiConnection::handleFirstLogin: Login request: ISID 0x00023d000001, TSIH 0x0000.
3/12 9:53:58.319868 1f2 C[1536a], XPT_UP: iScsiConnection::doTransition: Event - LOGIN.
3/12 9:53:58.319926 1f2 Params: iScsiParameter::update: <<< String param 'InitiatorName': received 'iqn.1998-01.com.vmware:wit-a-esx1-4a77acb3', accepted 'iqn.1998-01.com.vmware:wit-a-esx1-4a77acb3'
3/12 9:53:58.319953 1f2 Params: iScsiParameter::update: <<< String param 'TargetName': received 'iqn.2008-08.com.starwindsoftware:10.20.10.22-ds2', accepted 'iqn.2008-08.com.starwindsoftware:10.20.10.22-ds2'
3/12 9:53:58.319967 1f2 Params: iScsiParameter::update: <<< Enum param 'SessionType': received 'Normal', accepted 'Normal'
3/12 9:53:58.320001 1f2 Params: iScsiParameter::update: <<< Enum param 'AuthMethod': received 'CHAP,None', accepted 'CHAP'
3/12 9:53:58.324800 1f2 HA: HASyncNode::registerSession: Client initiator iqn.1998-01.com.vmware:wit-a-esx1-4a77acb3 is trying to register a session within the 'iqn.2008-08.com.starwindsoftware:10.20.10.22-ds2' target... (sessId = 0x1536a, initiatorNameIsid = iqn.1998-01.com.vmware:wit-a-esx1-4a77acb3,00023D000001)
3/12 9:53:58.324821 1f2 HA: HASyncNode::registerSession: Unable to register the new client session. The node is not active!
3/12 9:53:58.324832 1f2 HA: HASyncNode::registerSession: Return code 21.
This means that DS2 on 10.20.10.22 was not synchronized quite a while ago. DS1 was in sync though.
Based on the logs from Node2, DS2 went out of sync last time on MAR 10.
Line 3483: 3/10 21:33:23.193383 bd HA: CHAPartnerNode::SetSyncStatus: Event: Partner device changed sync status to 3 from 1. (partner: 'iqn.2008-08.com.starwindsoftware:10.20.10.22-ds1')
Line 3536: 3/10 21:46:33.881135 78 HA: CHAPartnerNode::SetSyncStatus: Event: Partner device changed sync status to 3 from 1. (partner: 'iqn.2008-08.com.starwindsoftware:10.20.10.22-ds2')
DS1 went back to sync, while DS2 did not.
HA devices went out of sync due to timeouts.
3/10 21:30:07.622371 40 HA: Ssc_Request_Task::complete: Warning(Partner send): sscRequestTask(0x0000000043851270) partnerRequest(0x0000000043C72010) GeneralFunction(Execute SCSI Command) GeneralOpCode(0x2A) ex(0x0), PartnerOpCode(0xF2), PartnerDiffTimeCompleteEXEC = 6452, ms DiffTimeCompleteINIT = 6452 ms, DiffTimeCompleteEXEC = 6452 ms. Current target: 'iqn.2008-08.com.starwindsoftware:10.20.10.23-ds1', Partner traget: 'iqn.2008-08.com.starwindsoftware:10.20.10.22-ds1'!
3/10 21:33:23.167077 40 Common: *** iscsi_tcp_dispatch: request execution timed out (9999 ms) for target iqn.2008-08.com.starwindsoftware:10.20.10.22-ds1.
and
3/10 21:33:23.205687 a3 HA: Ssc_Request_Task::complete: Warning: sscRequestTask(0x0000000043C7A370) Funciton(Execute SCSI Command) OpCode(0x89) ex(0x0) DiffTimeCompleteINIT = 10049 ms, DiffTimeCompleteEXEC = 0 ms. Target: 'iqn.2008-08.com.starwindsoftware:10.20.10.23-ds1'
3/10 21:33:23.207897 b8 HA: Ssc_Request_Task::complete: Warning: sscRequestTask(0x000000004486A7E0) Funciton(Execute SCSI Command) OpCode(0x89) ex(0x0) DiffTimeCompleteINIT = 9966 ms, DiffTimeCompleteEXEC = 0 ms. Target: 'iqn.2008-08.com.starwindsoftware:10.20.10.23-ds1'
3/10 21:33:23.208585 b9 T[1c,df50af]: iScsiTask::handleTaskMgmtCmd: Management command: abort task (CmdSN 14635137, ITT 0x9950df00) not found.
3/10 21:33:23.209503 40 HA: Ssc_Request_Task::complete: Warning(Partner send): sscRequestTask(0x000000004370E6E0) partnerRequest(0x0000000043C64FE0) GeneralFunction(Execute SCSI Command) GeneralOpCode(0x2A) ex(0x0), PartnerOpCode(0xF2), PartnerDiffTimeCompleteEXEC = 9995, ms DiffTimeCompleteINIT = 9996 ms, DiffTimeCompleteEXEC = 9996 ms. Current target: 'iqn.2008-08.com.starwindsoftware:10.20.10.23-ds1', Partner traget: 'iqn.2008-08.com.starwindsoftware:10.20.10.22-ds1'!
Please check the underlying storage health.
I noticed that you have 3 GB of Write-Back cache. Service could stop unexpectedly and that could have triggered full sync. See more on what can start full sync
https://knowledgebase.starwindsoftware. ... may-start/.
Also, I noticed the priority shift (i.e., one device is of the 1st priority while another is of 2nd priority). See more about devices' priorities
https://forums.starwindsoftware.com/vie ... ity#p30321. How to change HA priority
https://forums.starwindsoftware.com/vie ... ity#p31795.
How to tweak timeouts:
1. Make sure that HA devices are synchronized.
2. Make sure that HA devices have iSCSI connections to them (this step is needed if you have client VMs there).
3. Stop StarWind VSAN service on one node.
4. Go To the config file and locate these parameters
StorPerfDegTimeLimitMs (set value to 10000)
iScsiGenCmdSendCmdTimeoutInSec (set value to 15)
iScsiPingCmdSendCmdTimeoutInSec (set value to 10)
5. Start the service.
6. Wait for the fast sync to be over.
7. Repeat on another serer.