Please help. Sync channels down

Software-based VM-centric and flash-friendly VM storage + free version

Moderators: anton (staff), art (staff), Max (staff), Anatoly (staff)

Davis
Posts: 24
Joined: Tue Jan 23, 2018 10:12 am

Thu Jan 17, 2019 2:33 pm

Hi all,
I have 2 nodes hyper-converged cluster.
4 1Gb NICs:
1 - clients +HB
2 - cluster +HB
3 - iSCSI+Sync
4 - Sync

3 and 4 are on single two-ports NIC. Connected via cables directly without switch.

Today I tried to replace that NICs with new Mellanox ConnectX-3, also dual ports, thus 3 and 4 networks should become 10Gb.
I've paused 1 node(2 node works with all VMs), stopped starwind service and disabled it.
Turned off system and replaced NIC.
Connected both nodes 2/3 NICS to 10Gb switch. Temporarily.
Turned system on and tuned new NIC: IPs, Jumbo, NETBIOS, win server/client.
Then I've enabled and run starwind service.
And now I stuck:

Code: Select all

Host: HVC1-1
JStarWind Virtual SAN v8.0.0 (Build 11456, [SwSAN], Win64)
Target:iqn.2008-08.com.starwindsoftware:hvc1-1-storage1
Initiators connected:
iqn.2008-08.com.starwindsoftware:hvc1-2-storage1
iqn.2008-08.com.starwindsoftware:hvc1-2-storage1
iqn.2008-08.com.starwindsoftware:hvc1-2-storage1
iqn.2008-08.com.starwindsoftware:hvc1-2-storage1
Devices:
Name                                : HAImage1
File                                : My Computer\D\StarWind\Storage1\Storage1_HA.swdsk
ha_synch_status                     : 3
ha_synch_status_desc                : Not synchronized
ha_wait_on_autosynch                : 0
ha_synch_percent                    : 0
ha_sync_estimated_time              : 0
ha_partner_node1_heartbeat_channels : 10.20.1.2:3260 UP
                                      192.168.0.182:3260 UP
ha_partner_node1_sync_channels      : 10.20.2.2:3260  DOWN
                                      10.20.3.2:3260  DOWN
Device not synchronized. Synchronize current node from partner manually!

Host: HVC1-2
JStarWind Virtual SAN v8.0.0 (Build 11456, [SwSAN], Win64)
Target:iqn.2008-08.com.starwindsoftware:hvc1-2-storage1
Initiators connected:
iqn.1991-05.com.microsoft:hvc1-1.ad.tv5.zp.ua
iqn.1991-05.com.microsoft:hvc1-2.ad.tv5.zp.ua
iqn.2008-08.com.starwindsoftware:hvc1-1-storage1
iqn.2008-08.com.starwindsoftware:hvc1-1-storage1
Devices:
Name                                : HAImage1
File                                : My Computer\D\StarWind\Storage1\Storage1_HA.swdsk
ha_synch_status                     : 1
ha_synch_status_desc                : Synchronyzed
ha_wait_on_autosynch                : 0
ha_synch_percent                    : 100
ha_sync_estimated_time              : 0
ha_partner_node1_heartbeat_channels : 192.168.0.181:3260 UP
                                      10.20.1.1:3260 UP
ha_partner_node1_sync_channels      : 10.20.2.1:3260 UP
                                      10.20.3.1:3260 UP
HVC1-2 is OK
While second node seems OK, first has both sync channels down.
And when I tried to run manual synchronization via GetHASyncState.ps1 I got:

Code: Select all

Device not synchronized. Synchronize current node from partner 'iqn.2008-08.com.starwindsoftware:hvc1-2-storage1'
Exception Исключение при вызове "ExecuteCommand" с "3" аргументами: "Error:
200 Failed: connection with partner node is invalid.. "
I can ping both nodes on all interfaces and IPs.
I've enabled and then disabled Jumbo Frames on both nodes.

Please tell me - what is wrong here?
Davis
Posts: 24
Joined: Tue Jan 23, 2018 10:12 am

Fri Jan 18, 2019 4:49 pm

How is better to replace NICs?
Boris (staff)
Staff
Posts: 805
Joined: Fri Jul 28, 2017 8:18 am

Fri Jan 18, 2019 4:53 pm

What is the output of the ping command with the -t switch on those connections between the 10Gbps and 1Gbps NICs? Can you run ping for some considerable amount of time like a couple of hours? Are any packets lost? What is the latency?
Davis
Posts: 24
Joined: Tue Jan 23, 2018 10:12 am

Fri Jan 18, 2019 5:11 pm

As I said ping is fine. Even one node connects to other but not vice versa:

Code: Select all

1.
ha_partner_node1_sync_channels      : 10.20.2.2:3260  DOWN
                                      10.20.3.2:3260  DOWN
2.
ha_partner_node1_sync_channels      : 10.20.2.1:3260 UP
                                      10.20.3.1:3260 UP
Boris (staff)
Staff
Posts: 805
Joined: Fri Jul 28, 2017 8:18 am

Fri Jan 18, 2019 5:33 pm

Stop the StarWind service on the node with the 10Gbps NICs, locate the iSER_DM.dll file in the StarWind installation folder and rename it (e.g. to iSER_DM.dll.bak). Start the StarWind service again and report the result.
Davis
Posts: 24
Joined: Tue Jan 23, 2018 10:12 am

Sat Jan 19, 2019 7:28 am

Thank you, Boris.
Yesterday I also got this idea about RDMA/ROCE.
I already reverted NIC to old one to restore cluster HA.
At Monday I will make a new attempt to upgrade using your recommendation.
Last edited by Davis on Sat Jan 19, 2019 8:51 am, edited 1 time in total.
Davis
Posts: 24
Joined: Tue Jan 23, 2018 10:12 am

Sat Jan 19, 2019 7:40 am

Hm. Let imagine that renaming will help, then how can I switch to normal iSER mode after both Nics will be replaced? Without stopping cluster. I will need to restart both services....
P.S. Can iSER be turned on/off without starwindservice restarting?
May be via
Set-MlnxDriverCoreSetting rocemode 0
Or Enable-NetAdapterRdma
?
Oleg(staff)
Staff
Posts: 568
Joined: Fri Nov 24, 2017 7:52 am

Tue Jan 22, 2019 9:51 am

Hi Davis,
You should be able to switch to normal iSER mode.
Just to have a production environment working, in case of this plan won't work, please start from the node with higher priority on StarWind devices.
Davis
Posts: 24
Joined: Tue Jan 23, 2018 10:12 am

Tue Jan 22, 2019 2:09 pm

Fortunately I have two sync channels.
So I've done upgrade this way:
- Move IP3 to NIC2 as alternate on both nodes, they normally survived this because I have secondary sync channel on NIC4.
- Shutdown and replace NICs(3+4) on both nodes by one at time
- set up NIC4 addresses on both nodes and thus sync2 become operational.
- return IP3 to NIC3.

How can I be sure that iSER used now?
Oleg(staff)
Staff
Posts: 568
Joined: Fri Nov 24, 2017 7:52 am

Wed Jan 23, 2019 2:17 pm

Actually, you can find this information in StarWind log.
Davis
Posts: 24
Joined: Tue Jan 23, 2018 10:12 am

Sun Jan 27, 2019 1:20 pm

grepped by iSER

node1:

Code: Select all

Line 8: 1/27 13:26:00.502 580 Srv: Loading iSER_DM module...
Line 42: 1/27 13:26:00.705 580 conf: Variable 'iSerListen' is set to ''.
Line 327: 1/27 13:26:09.611 e0c iSERs: Created QueuePair (recv 264, init 1056, cq 1320, group 0, affinity 0xfff).
Line 328: 1/27 13:26:09.611 e08 iSERs: Created QueuePair (recv 264, init 1056, cq 1320, group 0, affinity 0xfff).
Line 329: 1/27 13:26:09.611 e08 iSERs: IND2Connector::Connect failed with c0000236
Line 330: 1/27 13:26:09.611 e08 iSERs: iSerDmSocket::Connect failed with c0000236!
Line 330: 1/27 13:26:09.611 e08 iSERs: iSerDmSocket::Connect failed with c0000236!
Line 331: 1/27 13:26:09.611 e0c iSERs: IND2Connector::Connect failed with c0000236
Line 332: 1/27 13:26:09.611 e0c iSERs: iSerDmSocket::Connect failed with c0000236!
Line 332: 1/27 13:26:09.611 e0c iSERs: iSerDmSocket::Connect failed with c0000236!
Line 432: 1/27 13:26:37.332 580 iSERs: iSER_DM: Setting logLevel 0x1, logMask 0xbfffffff
Line 432: 1/27 13:26:37.332 580 iSERs: iSER_DM: Setting logLevel 0x1, logMask 0xbfffffff
After each boot these lines repeats.

node2. This node shows only such lines.:

Code: Select all

1/22 14:15:56.617 7a0 iSERs: NetDirect providers are not found.
1/22 14:15:56.617 7a0 Srv: iSER_DM is not loaded.
1/22 14:15:56.883 7a0 conf: Variable 'iSerListen' is set to ''.
1/22 14:16:01.117 7a0 iSERs: NetDirect providers are not found.
Oleg(staff)
Staff
Posts: 568
Joined: Fri Nov 24, 2017 7:52 am

Wed Jan 30, 2019 2:51 pm

Could you please clarify, are you using Mellanox ConnectX-3 or Mellanox ConnectX-3 Pro network cards?
Because Mellanox ConnectX-3 has only partial support of iSER.
Davis
Posts: 24
Joined: Tue Jan 23, 2018 10:12 am

Wed Jan 30, 2019 2:54 pm

MCX312B-XCCT ConnectX®-3 Pro EN
Oleg(staff)
Staff
Posts: 568
Joined: Fri Nov 24, 2017 7:52 am

Wed Jan 30, 2019 4:25 pm

Please try the following:
import-module StarWindX
$s = New-SWServer 127.0.0.1 3261 root starwind
$s.Connect()
#get list of interfaces for iSER
$addrs = $s.GetServerParameter("iSerListen")

#set list of interfaces for iSER
$s.SetServerParameter("iSerListen",$addrs)

PS C:\> $sw = New-SWServer 3261 root starwind
Davis
Posts: 24
Joined: Tue Jan 23, 2018 10:12 am

Wed Jan 30, 2019 5:03 pm

$s.GetServerParameter("iSerListen")
returns empty
Post Reply