Wed Nov 11, 2020 10:34 pm
Unfortunately, the StarWind VSAN dropped all connections again. Based on what I'm seeing, it won't stay up more than 24 hours, and the symptoms are looking pretty consistent.
I'm becoming convinced this is NOT an network issue, at least not at the Ethernet level. It's an ISCSI authentication issue. This last crash occurred at 4:21:33 PM, almost exactly 24 hours after the last "synchronization complete" logged in the StarWind event log, and there is an iScsiPrt error 10 (login request failed) at the exact same time in the Administrative Events, apparently originating in the System event log.
Interestingly, immediately before that, in the HV4 System log (but not on HV3), there is a iScsiPrt informational event status 32, showing "The initiator received an asynchronous logout message". It starts a whole cascade of iSscsiPrt and mpio related messages.
What I want to know is, why is StarWind on one host trying to logout its iSCSI connections exactly 24 hours after a synchronization complete?
Further details:
After a crash, all Image files on both hosts show in the StarWind MC as all synchronization channels down, and zero iSCSI sessions. Both Witness and CSV on HV3 show as synchronized, but on HV4 both are not synced.
I restarted HV4, but it then showed not synched either, and the iSCSI connections didn't come up.
I restarted HV3, the links came up and it started to synchronize (full synch).
So it looks like HV3 was the one that dropped the iSCSI connection, and was the one that needed to have its VSAN restarted to reconnect.
NOTE: So far, other than taking logs and implementing enhanced ping monitoring, all I've done is restart the two VSAN services.
Ping tests to all ports are still fine on both hosts, and I haven't touched anything yet. However, I had stopped the continuous pings yesterday after providing the stats to you. I have now started cross-host pings on both hosts, all five ports (using Ping Checker for logging), and will leave them running continuously. Just in case. Non-cluster VMs (DCs) on both hosts are running just find, with no apparent disruption, if that's worth something.
I pulled StarWind logs on both hosts before trying anything other than pings, will send them to you.
The Failover Cluster still shows the CSV disk offline, no change there. It's showing error code 0x80070046, "The remote server has been paused or is in the process of being restarted." The Witness disk is fine, though.
The saga continues...
-- kenw
------------------------
"In theory, theory and practice are the same, but in practice they're not." -- Yogi Berra