How to get maintenance status

Software-based VM-centric and flash-friendly VM storage + free version

Moderators: anton (staff), art (staff), Max (staff), Anatoly (staff)

wallewek
Posts: 114
Joined: Wed Sep 20, 2017 9:13 pm

Wed Nov 04, 2020 6:45 pm

just FYI, for monitoring I am a big fan of Paessler's PRTG.

--- kenw
------------------------
"In theory, theory and practice are the same, but in practice they're not." -- Yogi Berra
yaroslav (staff)
Staff
Posts: 2333
Joined: Mon Nov 18, 2019 11:11 am

Wed Nov 04, 2020 7:04 pm

I have heard about this solution but I have no experience with PRTG but I am not sure how to monitor StarWind VSAN with it.
wallewek
Posts: 114
Joined: Wed Sep 20, 2017 9:13 pm

Wed Nov 04, 2020 7:38 pm

It has multiple ways of monitoring, including pings, SNMP, WMI, and so on. The user interfaces is pretty good, easy to get started.

There is a free download for up to 10 nodes.

https://www.paessler.com/

https://www.paessler.com/san-monitoring

FYI
------------------------
"In theory, theory and practice are the same, but in practice they're not." -- Yogi Berra
yaroslav (staff)
Staff
Posts: 2333
Joined: Mon Nov 18, 2019 11:11 am

Thu Nov 05, 2020 7:17 pm

Thanks for sharing the info!
I saw it in some setups but have no details if they monitor anything StarWind-related there.
I guess you can apply PRTG to HA devices (they are disks after all), underlying storage, networking, etc. Everything that seems critical.
wallewek
Posts: 114
Joined: Wed Sep 20, 2017 9:13 pm

Fri Nov 06, 2020 11:34 pm

As I mentioned in the other thread on caching, things did not go well here, but it's not clear why. I thought things were looking stable, and was mucking around trying to do updates on a non-clustered secondary DC VM which is not on the StarWind VSAN either, so it shouldn't have been a problem. But the timing of the problem seems suspiciously related to when I was doing that.

Anyway, I stopped that backup schedule again, and I'll run a couple of logs and send them to you, but maybe I should just leave things to stabilize before digging deeper.

For what it's worth, this again was a case where all iSCSI connections, in both directions, went down (except apparently for one of the five connections, in one direction only -- the only one that did not go through the switch! Hmm...). And restarting the VSAN services on both hosts brought them all back up. Doing full resync now.

I'm tempted to bring the old Ethernet switch back (it's just a slightly older 24-port HP Procurve), or jumper directly between all of the quad port NICs. Does StarWind recommend patching Ethernet directly between hosts?

--kenw
------------------------
"In theory, theory and practice are the same, but in practice they're not." -- Yogi Berra
yaroslav (staff)
Staff
Posts: 2333
Joined: Mon Nov 18, 2019 11:11 am

Sat Nov 07, 2020 10:50 am

We recommend using direct links for iSCSI and sync as you currently have.
I will check the logs to see what could have caused this issue.
wallewek
Posts: 114
Joined: Wed Sep 20, 2017 9:13 pm

Sat Nov 07, 2020 6:41 pm

OK, thanks! Sync is now healthy, I'm not going to touch a thing for a few days at least, then will switch to direct links. One cable at a time, watching to ensure health before doing the next.

Would you recommend stopping VSAN services while I do this?

Oh, by the way, I found a great little free tool for monitoring network interface statics. It's called "NetworkCountersWatch" from nirsoft. They have a lot of good stuff!

-- kenw
------------------------
"In theory, theory and practice are the same, but in practice they're not." -- Yogi Berra
yaroslav (staff)
Staff
Posts: 2333
Joined: Mon Nov 18, 2019 11:11 am

Sun Nov 08, 2020 7:30 pm

Thanks for sharing a tool with us!
You'd better stop StarWind VSAN service on one side just in case.
wallewek
Posts: 114
Joined: Wed Sep 20, 2017 9:13 pm

Mon Nov 09, 2020 12:21 am

Oops, too late. It crashed again yesterday afternoon for no reason, starting with event log entries about iSCSI logon failures (error 10) as has happened a number of times lately, I cannot figure out why.

Anyway, I shut both notes down and re-cabled everything. Which was probably just as well, because when I did that, I discovered that a couple of ports on the quad NIC weren't what I thought they were. The driver naming was inconsistent, and patching straight across the same ports on both NICs meant their IPs didn't line up. I fixed that, swapping IPs on one side, and now StarWind is all fully synced up, and both it and the FCM are apparently happy with the network.

EXCEPT: now I can't get the CSV disk to come online in the FCM. I get an error code 0x80070046, "The remote server has been paused or is in the processes of being started." Also an error 5142 saying that the volume is no longer accessible from the cluster node because of an error '(1460)'.

I'm sending you diagnostic files, see your PM.

It's weird: the iSCSI targets all show connected, both in the iSCSI Initiator Properties and in the StarWind console.

-- kenw
------------------------
"In theory, theory and practice are the same, but in practice they're not." -- Yogi Berra
yaroslav (staff)
Staff
Posts: 2333
Joined: Mon Nov 18, 2019 11:11 am

Mon Nov 09, 2020 4:43 am

Actually, these problems with the NIC could be the cause of network flapping.
Is the issue happening on both nodes? Please restart the cluster service or servers one by one. Here is the user with the same problem https://social.technet.microsoft.com/Fo ... rverhyperv.
wallewek
Posts: 114
Joined: Wed Sep 20, 2017 9:13 pm

Mon Nov 09, 2020 11:30 pm

I've double-checked the NICs, they look rock-solid. Continuous pings on all ports run under 1 ms, zero timeouts. All diagnostics including both StarWInd and MS Cluster say the networks are good, zero complaints.

The MS Cluster Validation Report says the same on both hosts: the Validate Disk Failover test fails. That is the ONLY fatal error. The exact error message is "Node kmHV3.kmsi.net holds the SCSI PR on Test Disk 0 and brought the disk online, but failed in its attempt to write file data to partition table entry 1. The group or resource is not in the correct state to perform the requested operation."

BTW, I sent you the validation report output, but if you open with a web browser it's far easier to read! I sent you another copy renamed.

One thing I think may be helpful is that the Window disk manager shows the disk statuses differently on the two hosts, particularly on the Witness1 volume. I'm going to send you screenshots for both hosts so you can see what I mean. Ignore the SSD Cache volume; it's not currently used.

What concerns me is that the Witness disk on HV4 (but not HV3) is offline and does not show a drive letter. I don't recalling that behaviour before. It appears StarWind creates and mounts that disk device.

I've already stopped and rebooted everything (no help), but doing it again...
Restarting the VSAN service on HV4 -- no help, but it does resync immediately.
Restarting the VSAN service on HV3 -- no help, but it does resync immediately. (i.e., same as HV4)
Restarting the Cluster service on HV4 -- no change
Restarting the Cluster service on HV3 -- no change

That's about all I have for now. I'll PM you on the uploads.

Oh, BTW, searching online for that message in the validation report, I see mention of the Automount feature being turned off. Could swear I saw a setting for that somewhere...

--- kenw
Last edited by wallewek on Tue Nov 10, 2020 8:10 pm, edited 1 time in total.
------------------------
"In theory, theory and practice are the same, but in practice they're not." -- Yogi Berra
yaroslav (staff)
Staff
Posts: 2333
Joined: Mon Nov 18, 2019 11:11 am

Tue Nov 10, 2020 4:23 am

Witness actually should not have a drive letter. Will review the logs.
wallewek
Posts: 114
Joined: Wed Sep 20, 2017 9:13 pm

Tue Nov 10, 2020 8:28 pm

Just for info, I ran continuing pings for all 5 Ethernet ports for about a day. The results were about 400,000 total pings with 0 drops, average round trip time 0 ms on all ports, max round trip time ranged from 4 ms to 9 ms.

NetworkCountersWatch shows one input error on one port on one host (HV4). Might be from when I was stopping/restarting services.

I'd say that's pretty solid.

--- kenw
------------------------
"In theory, theory and practice are the same, but in practice they're not." -- Yogi Berra
yaroslav (staff)
Staff
Posts: 2333
Joined: Mon Nov 18, 2019 11:11 am

Wed Nov 11, 2020 8:27 am

Yeah indeed! Is StarWind VSAN working well? I mean no heartbeat drops to date?
wallewek
Posts: 114
Joined: Wed Sep 20, 2017 9:13 pm

Wed Nov 11, 2020 10:34 pm

Unfortunately, the StarWind VSAN dropped all connections again. Based on what I'm seeing, it won't stay up more than 24 hours, and the symptoms are looking pretty consistent.

I'm becoming convinced this is NOT an network issue, at least not at the Ethernet level. It's an ISCSI authentication issue. This last crash occurred at 4:21:33 PM, almost exactly 24 hours after the last "synchronization complete" logged in the StarWind event log, and there is an iScsiPrt error 10 (login request failed) at the exact same time in the Administrative Events, apparently originating in the System event log.

Interestingly, immediately before that, in the HV4 System log (but not on HV3), there is a iScsiPrt informational event status 32, showing "The initiator received an asynchronous logout message". It starts a whole cascade of iSscsiPrt and mpio related messages.

What I want to know is, why is StarWind on one host trying to logout its iSCSI connections exactly 24 hours after a synchronization complete?

Further details:

After a crash, all Image files on both hosts show in the StarWind MC as all synchronization channels down, and zero iSCSI sessions. Both Witness and CSV on HV3 show as synchronized, but on HV4 both are not synced.

I restarted HV4, but it then showed not synched either, and the iSCSI connections didn't come up.
I restarted HV3, the links came up and it started to synchronize (full synch).
So it looks like HV3 was the one that dropped the iSCSI connection, and was the one that needed to have its VSAN restarted to reconnect.

NOTE: So far, other than taking logs and implementing enhanced ping monitoring, all I've done is restart the two VSAN services.

Ping tests to all ports are still fine on both hosts, and I haven't touched anything yet. However, I had stopped the continuous pings yesterday after providing the stats to you. I have now started cross-host pings on both hosts, all five ports (using Ping Checker for logging), and will leave them running continuously. Just in case. Non-cluster VMs (DCs) on both hosts are running just find, with no apparent disruption, if that's worth something.

I pulled StarWind logs on both hosts before trying anything other than pings, will send them to you.

The Failover Cluster still shows the CSV disk offline, no change there. It's showing error code 0x80070046, "The remote server has been paused or is in the process of being restarted." The Witness disk is fine, though.

The saga continues...

-- kenw
------------------------
"In theory, theory and practice are the same, but in practice they're not." -- Yogi Berra
Post Reply