Node Disconnects and HA replication path lost
Posted: Tue Jul 17, 2018 1:07 am
I have the following setup in a DEV/LAB environment.
Two - Dell R510 running Server 2016 with StarWind V8.0.12166
2x Xeon X5660
32GB RAM
Perc H700 with 12x 4TB SAS in RAID10
1x Intel 1.6T NVME
2x 1GbE ports for iSCSI Targets
2x 10GbE Mellanox ConnectX-3 for Sync
Last week I ran into an issue where the node I have named "SAN01" marked all of its sync channels as "offline" and then started showing the following in the event logs.
HA Device iqn.2008-08.com.starwindsoftware:###-san01-###-##-#####: partner node iqn.2008-08.com.starwindsoftware:###-san02-###-##-##### state has changed to "Not synchronized".
I tried to run the PS script for performing a synchronization on all disks but that did not work. As I'm still in the Trial period I fired up the GUI, only to find that on the SAN01 node the GUI would not connect.
As a last ditch effort I rebooted 01 but it hung for going on 12 hours. At that point another engineer rebooted the system and we had disk corruption, which I'm not blaming anyone but us for. (This is storage and it's not easy)
However today I'm encountering the same issues on one of my disks. I have collected my system logs and was wondering if someone could take a look? Maybe let me know what I'm doing wrong?
Two - Dell R510 running Server 2016 with StarWind V8.0.12166
2x Xeon X5660
32GB RAM
Perc H700 with 12x 4TB SAS in RAID10
1x Intel 1.6T NVME
2x 1GbE ports for iSCSI Targets
2x 10GbE Mellanox ConnectX-3 for Sync
Last week I ran into an issue where the node I have named "SAN01" marked all of its sync channels as "offline" and then started showing the following in the event logs.
HA Device iqn.2008-08.com.starwindsoftware:###-san01-###-##-#####: partner node iqn.2008-08.com.starwindsoftware:###-san02-###-##-##### state has changed to "Not synchronized".
I tried to run the PS script for performing a synchronization on all disks but that did not work. As I'm still in the Trial period I fired up the GUI, only to find that on the SAN01 node the GUI would not connect.
As a last ditch effort I rebooted 01 but it hung for going on 12 hours. At that point another engineer rebooted the system and we had disk corruption, which I'm not blaming anyone but us for. (This is storage and it's not easy)
However today I'm encountering the same issues on one of my disks. I have collected my system logs and was wondering if someone could take a look? Maybe let me know what I'm doing wrong?