HA image falling out of sync

primo · Mon Mar 15, 2021 8:30 pm

Hi,

I've noticed that one HA image in our 2-node hyperconverged cluster fell out of sync (which is interesting, because the other remaining HA image was ok) and stayed unsynced until I've manually triggered a sync. After that, a full synhronization started.
Which logs should I investigate if we use Linux VSAN appliances? Any idea about what could be going on?

: sw2.PNG (43.32 KiB) Viewed 10559 times

: sw61.jpg (198.5 KiB) Viewed 10559 times

: sw71.jpg (255.43 KiB) Viewed 10559 times

Thanks!

Tue Mar 16, 2021 4:13 am

Hi,

Welcome to StarWind Forum. Yes, please collect the logs as described here and share the collection with me https://knowledgebase.starwindsoftware. ... collector/.

primo · Tue Mar 16, 2021 8:53 am

Thank you! I get the following error while trying to collect the logs:

[root@SW1 CollectLogs]# ./collectswlogs.run ./SW1-Logs
Verifying archive integrity... All good.
Uncompressing Collect Logs script 100%
./collectlogs.sh: line 7: [: missing `]'

Tue Mar 16, 2021 9:09 am

But does the script collect the logs?

primo · Fri Mar 19, 2021 7:44 am

Yes it does. Great, I'm already looking forward to identifying the root cause. Thank you!

Best Regards,
primo

Sat Mar 20, 2021 11:38 am

Hi,

Thanks for logs.
I was not able to find any events on 01 log about the partner going out of sync.
NODE1
3/15 20:04:53.987656 182 HA: HASyncNode::fixTypeForSync: Synchronization method HA_SYNCH_TYPE_FAST will be changed to HA_SYNC_TYPE_FULL because storage IO errors were detected.
here is 01 changing its status to Synchronizing
3/15 20:04:53.994867 8b HA: HANode::setSyncStatus: Event: Device changed own sync status to 2 from 3. (target 'iqn.2008-08.com.starwindsoftware:10.20.10.22-ds2')
Note that sync was running only for DS2.
Also, there are no logs older than Mar 12, while another node contains older logs. Logs on 01 are flooded with 3/12 9:53:56.127478 f3 Tgt: *** iScsiTarget::openSession: iqn.2008-08.com.starwindsoftware:10.20.10.22-ds2: can't register session. The device 'HAImage2' is not ready.
3/12 9:53:56.127489 f3 T[15369,1]: ***iScsiTask::startLoginPhase: *ERROR* Login request: device open failed.
3/12 9:53:56.127608 18b C[15369], IN_LOGIN: iScsiConnection::doTransition: Event - LOGIN_REJECT.
3/12 9:53:56.378476 f3 C[15369], IN_LOGIN: iScsiConnection::recvData: Recv - peer shutdown
3/12 9:53:56.378529 f3 C[15369], IN_LOGIN: iScsiConnection::receive: recvData returned error 10058 (0x274a)!
3/12 9:53:56.379032 a8 S[15369]: iScsiSession::~iScsiSession: ~Session
3/12 9:53:58.068781 9 Srv: iScsiServer::listenConnections: Accepted iSCSI connection from 172.16.10.20:14920 to 172.16.10.22:3260. (Id = 0x1536a)
3/12 9:53:58.068839 9 S[1536a]: iScsiSession::iScsiSession: Session (00007FE23B1D1700)
3/12 9:53:58.068851 9 C[1536a], FREE: iScsiConnection::doTransition: Event - CONNECTED.
3/12 9:53:58.319782 1f2 C[1536a], XPT_UP: iScsiConnection::handleFirstLogin: Login request: ISID 0x00023d000001, TSIH 0x0000.
3/12 9:53:58.319868 1f2 C[1536a], XPT_UP: iScsiConnection::doTransition: Event - LOGIN.
3/12 9:53:58.319926 1f2 Params: iScsiParameter::update: <<< String param 'InitiatorName': received 'iqn.1998-01.com.vmware:wit-a-esx1-4a77acb3', accepted 'iqn.1998-01.com.vmware:wit-a-esx1-4a77acb3'
3/12 9:53:58.319953 1f2 Params: iScsiParameter::update: <<< String param 'TargetName': received 'iqn.2008-08.com.starwindsoftware:10.20.10.22-ds2', accepted 'iqn.2008-08.com.starwindsoftware:10.20.10.22-ds2'
3/12 9:53:58.319967 1f2 Params: iScsiParameter::update: <<< Enum param 'SessionType': received 'Normal', accepted 'Normal'
3/12 9:53:58.320001 1f2 Params: iScsiParameter::update: <<< Enum param 'AuthMethod': received 'CHAP,None', accepted 'CHAP'
3/12 9:53:58.324800 1f2 HA: HASyncNode::registerSession: Client initiator iqn.1998-01.com.vmware:wit-a-esx1-4a77acb3 is trying to register a session within the 'iqn.2008-08.com.starwindsoftware:10.20.10.22-ds2' target... (sessId = 0x1536a, initiatorNameIsid = iqn.1998-01.com.vmware:wit-a-esx1-4a77acb3,00023D000001)
3/12 9:53:58.324821 1f2 HA: HASyncNode::registerSession: Unable to register the new client session. The node is not active!
3/12 9:53:58.324832 1f2 HA: HASyncNode::registerSession: Return code 21.

This means that DS2 on 10.20.10.22 was not synchronized quite a while ago. DS1 was in sync though.
Based on the logs from Node2, DS2 went out of sync last time on MAR 10.
Line 3483: 3/10 21:33:23.193383 bd HA: CHAPartnerNode::SetSyncStatus: Event: Partner device changed sync status to 3 from 1. (partner: 'iqn.2008-08.com.starwindsoftware:10.20.10.22-ds1')
Line 3536: 3/10 21:46:33.881135 78 HA: CHAPartnerNode::SetSyncStatus: Event: Partner device changed sync status to 3 from 1. (partner: 'iqn.2008-08.com.starwindsoftware:10.20.10.22-ds2')
DS1 went back to sync, while DS2 did not.

HA devices went out of sync due to timeouts.
3/10 21:30:07.622371 40 HA: Ssc_Request_Task::complete: Warning(Partner send): sscRequestTask(0x0000000043851270) partnerRequest(0x0000000043C72010) GeneralFunction(Execute SCSI Command) GeneralOpCode(0x2A) ex(0x0), PartnerOpCode(0xF2), PartnerDiffTimeCompleteEXEC = 6452, ms DiffTimeCompleteINIT = 6452 ms, DiffTimeCompleteEXEC = 6452 ms. Current target: 'iqn.2008-08.com.starwindsoftware:10.20.10.23-ds1', Partner traget: 'iqn.2008-08.com.starwindsoftware:10.20.10.22-ds1'!
3/10 21:33:23.167077 40 Common: *** iscsi_tcp_dispatch: request execution timed out (9999 ms) for target iqn.2008-08.com.starwindsoftware:10.20.10.22-ds1.
and
3/10 21:33:23.205687 a3 HA: Ssc_Request_Task::complete: Warning: sscRequestTask(0x0000000043C7A370) Funciton(Execute SCSI Command) OpCode(0x89) ex(0x0) DiffTimeCompleteINIT = 10049 ms, DiffTimeCompleteEXEC = 0 ms. Target: 'iqn.2008-08.com.starwindsoftware:10.20.10.23-ds1'
3/10 21:33:23.207897 b8 HA: Ssc_Request_Task::complete: Warning: sscRequestTask(0x000000004486A7E0) Funciton(Execute SCSI Command) OpCode(0x89) ex(0x0) DiffTimeCompleteINIT = 9966 ms, DiffTimeCompleteEXEC = 0 ms. Target: 'iqn.2008-08.com.starwindsoftware:10.20.10.23-ds1'
3/10 21:33:23.208585 b9 T[1c,df50af]: iScsiTask::handleTaskMgmtCmd: Management command: abort task (CmdSN 14635137, ITT 0x9950df00) not found.
3/10 21:33:23.209503 40 HA: Ssc_Request_Task::complete: Warning(Partner send): sscRequestTask(0x000000004370E6E0) partnerRequest(0x0000000043C64FE0) GeneralFunction(Execute SCSI Command) GeneralOpCode(0x2A) ex(0x0), PartnerOpCode(0xF2), PartnerDiffTimeCompleteEXEC = 9995, ms DiffTimeCompleteINIT = 9996 ms, DiffTimeCompleteEXEC = 9996 ms. Current target: 'iqn.2008-08.com.starwindsoftware:10.20.10.23-ds1', Partner traget: 'iqn.2008-08.com.starwindsoftware:10.20.10.22-ds1'!

Please check the underlying storage health.

I noticed that you have 3 GB of Write-Back cache. Service could stop unexpectedly and that could have triggered full sync. See more on what can start full sync https://knowledgebase.starwindsoftware. ... may-start/.
Also, I noticed the priority shift (i.e., one device is of the 1st priority while another is of 2nd priority). See more about devices' priorities https://forums.starwindsoftware.com/vie ... ity#p30321. How to change HA priority https://forums.starwindsoftware.com/vie ... ity#p31795.

How to tweak timeouts:

1. Make sure that HA devices are synchronized.
2. Make sure that HA devices have iSCSI connections to them (this step is needed if you have client VMs there).
3. Stop StarWind VSAN service on one node.
4. Go To the config file and locate these parameters

StorPerfDegTimeLimitMs (set value to 10000)
iScsiGenCmdSendCmdTimeoutInSec (set value to 15)
iScsiPingCmdSendCmdTimeoutInSec (set value to 10)

5. Start the service.
6. Wait for the fast sync to be over.
7. Repeat on another serer.

primo · Mon Mar 22, 2021 5:51 pm

Hello,

thank you for the analysis. Is the following procedure correct?

1. Enable maintenance mode for all devices
2. Stop services -> which services should be stopped?
2. Change the priority for the DS2 device only (change <priority> parameter from 1 to 0 and vice versa for owner and partner nodes for both nodes in .swdsk files)
3. Restart VSAN appliances
4. Exit maintenance mode

Regarding the size of the Write-Back cache. Did you mean that the size is not optimal?

Thanks!

Tue Mar 23, 2021 4:26 am

Primo,

No. By enabling maintenance mode you will literally remove the shared storage from your nodes. See more at https://www.starwindsoftware.com/help/M ... eMode.html
See this (improved) procedure
1. Make sure that HA devices are synchronized.
2. Make sure that HA devices have iSCSI connections to them (this step is needed if you have client VMs there).
3. Stop StarWind VSAN service on one node. Run systemctl stop StarWindVSA only on one StarWind VM.
4. Copy the StarWind.cfg file at /opt/StarWind/StarWindVSA/drive_c/StarWind/
5. Go To the config file and locate these parameters.
StorPerfDegTimeLimitMs (set value to 10000)
iScsiGenCmdSendCmdTimeoutInSec (set value to 15)
iScsiPingCmdSendCmdTimeoutInSec (set value to 10)
6. Save the changes.
7. Go to DS2 HA header.
8. Copy the HA header.
9. Apply the priority fix to one header.
10. Start the service.
11. Wait for the fast sync to be over.
12 . Repeat on another serer.

primo · Wed Mar 24, 2021 7:45 am

Thank you, I've performed the changes. Should I also change the size of the Write-Back cache?

Wed Mar 24, 2021 8:50 am

We recommend using 1 GB of RAM cache per 1 TB of HA device.
But please note that the larger cache you have the longer it takes for the StarWindVSA to stop; therefore, the risk of getting full sync instead of fast increases (full sync is triggered if RAM cache does not flush properly; find out more about reasons of full sync https://knowledgebase.starwindsoftware. ... may-start/).

Let me know if you have more questions.

primo · Wed Mar 24, 2021 6:18 pm

Do you know what might be causing this events:
2021-03-23 22:58:42: ID_825 HA Device iqn.2008-08.com.starwindsoftware:10.20.10.22-ds1: command "0x89". Request execution time is longer than expected. Response time is 11 sec
2021-03-24 19:14:14: ID_825 HA Device iqn.2008-08.com.starwindsoftware:10.20.10.22-ds1: command "0xF3". Request execution time is longer than expected. Response time is 10 sec
...and many more similar events...

Do you think that the poor performance of the storage could be related to the fact that we use a Linux and not a Windows VSA? Some guys on the forum were reporting that.
The underlying storage is not that bad, both nodes have the following configuration:

HDD: 4x 2.4TB 10K SAS
Hardware RAID (1 GB BBWC)
- RAID type: RAID 10
- Disk cache policy: Disabled
- Write policy: Write-back
- Read policy: Read-ahead
- Stripe Size: 64K
The vHDDs inside VSA appliances are thick provision eager zeroed (VMFS6)

Thank you!

Thu Mar 25, 2021 10:32 am

Hello,

Those are delays that could happen during backups. Did sync drop? Those can be latency spikes (i.e., latencies get back to the norm before storage is marked as degraded/not synchronized).