How to get maintenance status

Software-based VM-centric and flash-friendly VM storage + free version

Moderators: anton (staff), art (staff), Max (staff), Anatoly (staff)

yaroslav (staff)
Staff
Posts: 2279
Joined: Mon Nov 18, 2019 11:11 am

Tue Oct 27, 2020 6:21 am

Hi,

Please see if both hosts were restarted at a time. Sorry, I do not have logs at hand now; could you remind me if you have RAM cache configured?
Did you backup the underlying storage?
See if there is anything about "Underlying storage response time" in the application log before the sync drop.

You can also share a new package of logs. Does that happen every time you do a backup?
wallewek
Posts: 114
Joined: Wed Sep 20, 2017 9:13 pm

Tue Oct 27, 2020 11:14 pm

Here's an update. I'm still digging through event logs and such.

It happened again this afternoon, very similar if not the exact same thing.

One host (HV4)'s VSAN stayed up and synced, and the Failover Cluster stayed up, but the other host (HV3) lost virtually all iSCSI connectivity, and both its Witness and CSV devices went unsynchronized. StarWind Management Console stayed up on both, and reported the same status on both.

Looking at the event logs on HV3, one second before the iSCSI connections were lost, I see two Error 10 entries from source iScsiPrt with the text "Login request failed. The logon response packet is given in the dump data."

I'd call that a smoking gun. There's nothing else in that host's event logs for an hour before that.

On the other server (HV4, the one that stayed up) I see related stuff at the same time. There's an informational iScsiPrt Event ID 32 "Initiator received an asynchronous logout message. The Target name is given in the dump data.", immediately followed by a iScsiPrt Event ID 10 Login request failed just like on the other server. Technically, the logout is showing up maybe a hundredths of a second _after_ the other server starts to list login errors. That might just be due to the servers' clocks being a hundredth of a second out of sync.

In HV4's event logs I see some activity related to Microsoft Updates, but it didn't appear to actually do any, and there's a time gap before the problem occurs.

Looking online, I see an interesting reference to problems with Microsoft KB4499177. I have been doing some MS updates lately, after applying the StarWind-recommended CAU script. But I don't see that one listed in Installed Updates.

I stopped and restarted the SW VSAN service on the hosts that was showing as unsynced and had lost the iSCSI connections, and it resynched immediately.

And this doesn't look backup-related -- I stopped backups on both hosts. Yes, I do run RAM and SSD cache.

That iScsiPrt logon stuff looks very suspicious, wouldn't you say?

--- kenw
------------------------
"In theory, theory and practice are the same, but in practice they're not." -- Yogi Berra
yaroslav (staff)
Staff
Posts: 2279
Joined: Mon Nov 18, 2019 11:11 am

Wed Oct 28, 2020 4:56 am

Please disable SSD and RAM cache as discussed here https://knowledgebase.starwindsoftware. ... -l1-cache/ and https://knowledgebase.starwindsoftware. ... dance/661/.
Can I have a new set of logs, please?
wallewek
Posts: 114
Joined: Wed Sep 20, 2017 9:13 pm

Wed Oct 28, 2020 11:12 pm

OK, I've removed SSD and RAM cache for both CSV and Witness devices on both hosts. That's a total of eight swdsk files that needed to have cache and storage sections removed.

For what it's worth, those documents didn't specifically mention the *_HA files at all, but I figured they had to be done as well, and one document recommended leaving the cache -- /cache tags in but the other didn't mention that. I did anyway.

It's very interesting to note, after removing the cache, the resync is actually twice as fast as it was before. I know that because I accidentally triggered a full resync (my fault, don't ask), and it takes about half the time. I guess cache is mostly helpful for small/random I/O. Anyway, interesting.

I'll send you a couple of sets of logs, one from a couple of days ago immediately after the last "collapse", and one from today after the cache-ectomy and resync.

--- kenw
------------------------
"In theory, theory and practice are the same, but in practice they're not." -- Yogi Berra
yaroslav (staff)
Staff
Posts: 2279
Joined: Mon Nov 18, 2019 11:11 am

Fri Oct 30, 2020 9:57 am

Yes, caching can affect sync speed.
Good to know that you managed to remove it.
Yes, please share the logs. Please collect the logs as fast as possible after the problem occurs again.
wallewek
Posts: 114
Joined: Wed Sep 20, 2017 9:13 pm

Fri Oct 30, 2020 4:37 pm

The logs are there, as described, one pair immediately after crash, one later.

Yesterday things were looking stable so I got brave and tried manually triggering another Cluster Aware Update using the script from https://www.starwindsoftware.com/blog/w ... e-updating

But it crashed the cluster. One side synced, the other not, but iSCSI connections being rejected as usual. I tried stopping and restarting VSAN services, but although that corrected the iSCSI connection problem, I had to force sync status on one side before sync would run (full, of course).

I didn't bother doing another log set as you already have two. Doesn't look like the changes helped. I double checked the logs from the CAU script, looks like it worked and everything was synched. I'm going to try another CAU run today to ensure both hosts are patched the same.

I spend some time looking at server event logs, it seems clear that the problem closely related to the iSCIprt logout/login failures I described in https://forums.starwindsoftware.com/pos ... 64#pr31808 The only question is why that is occurring. Is it load balancing? But why all 5 channels at once? And why does it sometimes take down the cluster, but not always?

---

OK, CAU ran to completion without StarWind problems. I guess we'll leave it alone for a bit before I re-enable backups. I think I'm going to put StorageCraft ShadowProtect on both hosts, and just exclude *.img files from backup.

---kenw
------------------------
"In theory, theory and practice are the same, but in practice they're not." -- Yogi Berra
wallewek
Posts: 114
Joined: Wed Sep 20, 2017 9:13 pm

Sun Nov 01, 2020 9:11 pm

There's been another sync failure: HV4 is reporting all channels down to HV3. That might be starting to be a pattern.

The Microsoft Failover Cluster stayed up and one StarWind host stayed up, but it's reporting all connection lost to the other host -- again. The "failed" VSAN service is talking to the Management console, but it did apparently stop at one point, because I had to manually reconnect on one host.

There has been nothing changed since the cache removal and Windows updates, although both hosts are now getting alerts for more updates needed -- unclear what.

Before taking any action, I've run logs on both servers and taken screenshots of the StarWind manager screens and uploaded them. I'm going to look at the event logs and see what I can see.

Before that, tried restarting VSAN service on HV3, since that is the one that appear to have failed sync and connection.
I got an error 1053 (did not respond to start or control requests) trying to stop the vSAN service, sat in "stopping" state. I restarted that server, and StarWind came up cleanly and in sync. The Failover cluster was undisturbed. So there's that good thing. Although I have to wonder how the cluster was able to tolerate having one side of the CSV down -- or was it? Hmm. Not sure.

Looking at event logs, it appears that the connection losses started almost exactly 24 hours after the previous shutdown, although I'm still confirming details.

--- kenw
------------------------
"In theory, theory and practice are the same, but in practice they're not." -- Yogi Berra
yaroslav (staff)
Staff
Posts: 2279
Joined: Mon Nov 18, 2019 11:11 am

Mon Nov 02, 2020 4:46 am

Hi,

This can be a connection loss. I will examine the logs to confirm that.
Failover cluster could tolerate storage going out of sync because its targets are connected properly and all I/O was redirected to a healthy node.
yaroslav (staff)
Staff
Posts: 2279
Joined: Mon Nov 18, 2019 11:11 am

Mon Nov 02, 2020 5:42 am

I have checked the log.
First, are you using one 1uad port physical network card? If so, please consider adding one more physical card to meed StarWind VSAN system requirements https://www.starwindsoftware.com/system-requirements and make your setup less vulnerable to split-brain issues.
What I noticed indeed seems to be a network-related problem. First, please try disabling Jumbo frames on all ports.
1. Make sure that HA devices are synchronized.
2. See if all iSCSI sessions are in place.
3. Migrate VMs to server 4
4. Stop StarWind VSAN service on server 3.
5. Remove Jumbo frames for all NICs.
6. Start StarWind VSAN service on server 3.
7. Wait for the fast sync to finish.
8. Repeat the procedure for host 4.


Also, I noticed that cache files are still in the Management Console (those gray devices at the bottom). Here is how you can remove them.
1. Make sure that HA devices are synchronized.
2. See if all iSCSI sessions are in place.
3. Migrate VMs to server 4
4. Stop StarWind VSAN service on server 3.
5. Make one more copy of StarWind.cfg.
6. Find these lines at the bottom of StarWind.cfg

Code: Select all

    <device file="My Computer\G\Flash-Witness\Flash-Witness.swdsk" name="imagefile1"/>
    <device file="My Computer\F\Witness\Witness.swdsk" name="imagefile2"/>
    <device name="HAImage1" OwnTargetName="iqn.2008-08.com.starwindsoftware:kmhv4-witness" file="My Computer\F\Witness\Witness_HA.swdsk" serialId="23D4D44A2017F8D0" asyncmode="yes" readonly="no" highavailability="yes" buffering="no" header="65536" reservation="no" CacheMode="wb" CacheSizeMB="128" CacheBlockExpiryPeriodMS="5000" AluaNodeGroupStates="0,0" Storage="imagefile2"/>
    <device file="My Computer\G\Flash-CSV1\Flash-CSV1.swdsk" name="imagefile3"/>
    <device file="My Computer\F\CSV1\CSV1.swdsk" name="imagefile4"/>
    <device name="HAImage2" OwnTargetName="iqn.2008-08.com.starwindsoftware:kmhv4.kmsi.net-csv1" file="My Computer\F\CSV1\CSV1_HA.swdsk" serialId="F6DEE586A3D97BF0" asyncmode="yes" readonly="no" highavailability="yes" buffering="no" header="65536" reservation="no" CacheMode="wb" CacheSizeMB="1200" CacheBlockExpiryPeriodMS="5000" AluaNodeGroupStates="0,0" Storage="imagefile4"/>
7. Remove these lines
<device file="My Computer\G\Flash-CSV1\Flash-CSV1.swdsk" name="imagefile3"/>
and
<device file="My Computer\G\Flash-Witness\Flash-Witness.swdsk" name="imagefile1"/>
8. Start StarWind VSAN service on server 3.
9. Wait for the fast sync to finish.
10. Repeat the procedure for host 4.

Seeing the problem, it can be either a physical card or just Jumbo frames. Let us remove reduce MTU to 1514 first and see if that rectifies the issue.
wallewek
Posts: 114
Joined: Wed Sep 20, 2017 9:13 pm

Tue Nov 03, 2020 6:07 pm

Going through the event logs, I see a event ID 108 from MSiSCSI saying "Status 0x00001069 determining that the device interface <…> does not support iSCSI WMI interfaces. If this device is not an iSCSI HBA then this error can be ignored."

Is this event relevant for StarWind iSCSI connections? Do I need to address it? I do not use iSCSI for any other purpose.

--- kenw
------------------------
"In theory, theory and practice are the same, but in practice they're not." -- Yogi Berra
wallewek
Posts: 114
Joined: Wed Sep 20, 2017 9:13 pm

Tue Nov 03, 2020 7:57 pm

Thanks Yarislav.

It's interesting that you should mention the jumbo frames. I did initially have a problem related to them months ago, when I replaced the network switch between the two servers, because I forgot to enable jumbo frames on the switch. Being an HP Procurve switch, it listed a lot of errors on the Starwind ports until I fixed that, but there have been zero errors logged since then. I have also run Nirsoft NetworkCountersWatch, and it shows zero errors as well, and it shows the MTU as already at 1500. If you have indications of problem, can you please tell me how you found them?

But it's worth nothing that this cluster ran without problems for a couple of years, with jumbo packets, L1 and L2 cache, etc. The problems didn't start until recently.

What makes me question whether this is a network configuration issues is that, when the problem occurs, it occurs suddenly -- with a couple of seconds -- across all iSCSI connections, whether they go through the switch or are on the Quad port NIC or not. And on at least one occasion it was immediately preceded with an iSCSIprt logout. What the heck could cause such a thing? The event log messages seem to point to some sort of authentication issue.

To review, here is the current network configuration. These servers each have 5 Intel Gigabit Ethernet interfaces, with four of them on a quad port NIC and one on the motherboard. All the ports on the quad NICs are dedicated to StarWind; the onboard NICs (one on each host server) are used for host management, Internet, etc.

The quad NICs are arranged in four pairs across the two hosts, with each NIC on one host in a separate subnet corresponding to the same subnet on the other host. So:

Host 1 (KMHV3) <--> Host 2 (KMHV4)
10.4.27.81 <-via switch-> 10.4.27.82 -- management (Onboard port)
10.4.33.81 <-via switch-> 10.4.33.82 -- StarWind (Quad port)
10.4.34.81 <-via switch-> 10.4.34.82 -- StarWind (Quad port)
10.4.35.81 <-via switch-> 10.4.35.82 -- StarWind (Quad port)
10.4.36.81 <-direct-> 10.4.36.82 -- StarWind (Quad port)

All of the Ethernet connections go through an HP Procurve managed switch except for the last, which is patched directly across between the servers. I do that because the switch gives me more diagnostic information than these modern Intel NIC drivers will provide, like packet errors and such, and also it avoids some bogus error statuses.

In any case, I have no problem disabling jumbo packets -- even though this cluster ran fine for a couple of years with them, with no problems. Done.

Before I remove the cache "stubs" showing on the console, let me ask: I was thinking, once this stability issue is resolved, of re-implementing at least a bit of level 1 cache. Does that make sense to you at all? I do run a server-grade UPS.

Oh, and as for determining whether the iSCSI settings are correct: I've already looked them over carefully, and don't see anything wrong, but I'm no iSCSI expert. Is there any simple way to confirm? I'm going to send you some screenshots, as that's all I can find for config logging.

--- kenw
------------------------
"In theory, theory and practice are the same, but in practice they're not." -- Yogi Berra
wallewek
Posts: 114
Joined: Wed Sep 20, 2017 9:13 pm

Tue Nov 03, 2020 8:14 pm

Just by the way, I ran the Microsoft Failover Cluster Manager validation wizard on my network configuration, and it came up 100% fine. Is there any sort of validation we can run for the StarWind side, particularly the iSCSI aspect?

OH! and it's worth noting, if you see an error occasionally in the logs for adapter port #4 -- that's the port that's patched directly between servers. You'll get errors like that on one server when the other is shut down. It sees the port as disconnected. Another reason I like keeping a switch between them.

--- kenw
------------------------
"In theory, theory and practice are the same, but in practice they're not." -- Yogi Berra
yaroslav (staff)
Staff
Posts: 2279
Joined: Mon Nov 18, 2019 11:11 am

Wed Nov 04, 2020 5:16 am

I often see networking to behave like that. Also, I can see that Jumbo frames are enabled on iSCSI links and entries pointing to network disconnections in Application logs (logged on behalf of StarWind Service).
This implies that there is something with networking. Problems with MTU can rear their head randomly: I often see the setups that work well with Jumbos for years but problems start after some updates or without any good reason. Whatever is happening to that NIC is affecting all connections on that NIC. What you can do is run ping -t over all channels into files. After the issue occurs, just check if there are any delays/packet losses.

Please note that it is strongly recommended to add at least one more NIC for StarWind Heartbeat connection; one network card can malfunction leading to split-brain and is against our system requirements https://www.starwindsoftware.com/system-requirements.

L1 cache is good but while used in write-back mode BSOD or hard-rebooting the server it might cause problems (i.e., full sync). Furthermore, it is prone to making full syncs longer. So, I'd recommend leaving it disabled unless there are performance issues. And, If you have some, I will be happy to assist you.

In StarWind Logs, there is an iscsi-report file. I have reviewed that file and it seems that iSCSI targets are connected right.

In the folder, I noticed that there are screenshots with the connectivity issue. Are they the recent ones?
wallewek
Posts: 114
Joined: Wed Sep 20, 2017 9:13 pm

Wed Nov 04, 2020 4:48 pm

Thank you, Yaroslav, for that most interesting answer.

Yes, jumbo frames are now fully disabled, and no further issues since -- so far, anyway.

This isn't a heavily loaded system. Fascinating that jumbos can cause such problems, but I have to say, not really surprising. I've never been a big fan of iSCSI either, but it doesn't seem to be optional in this case. I may implement sort of some network monitoring, as you suggest. I have some options.

Thank you for clarifying that the iSCSI initiators are correctly configured!

Please note, I DO have more than one network card. There is the one-port onboard NIC, and the quad NIC. And I'm pretty sure that onboard NIC and one of the quad-port NICs are configured correctly as a redundant heartbeat NICs, are they not? Does that change your comments at all?

One thing that really puzzled me was that every time the VSAN went down, ALL channels -- both NICs -- were affected. I could never figure out why that would be, unless it was an authentication issue. Although all ports and NICs did have jumbo packets enabled, and no longer do.

Yes, the screenshots I sent were all the most recent, after the last issue. I removed jumbo packets immediately after that, haven't had another issue yet.

--- kenw
------------------------
"In theory, theory and practice are the same, but in practice they're not." -- Yogi Berra
yaroslav (staff)
Staff
Posts: 2279
Joined: Mon Nov 18, 2019 11:11 am

Wed Nov 04, 2020 5:43 pm

Hi,

Great news!!! Please let us monitor this setup for a while. Unfortunately, I cannot recommend any monitoring solutions from top of my head. We have a ProActive Support (https://www.starwindsoftware.com/starwi ... ve-support) offering but you'll need to purchase support from us and paid license I guess.
Yes, Jumbo frames can sometimes do such things.
There is the one-port onboard NIC, and the quad NIC.
Thanks for the additional info, that's what can be called redundant networking.

Feel free to contact us if additional assistance is required.
Post Reply