Is there a way to force iSER connections? (iSER Troubleshooting)

Software-based VM-centric and flash-friendly VM storage + free version

Moderators: anton (staff), art (staff), Max (staff), Anatoly (staff)

Post Reply
SiegfriedB
Posts: 13
Joined: Thu Oct 12, 2023 6:20 pm

Fri Mar 01, 2024 9:25 am

I am currently building my first ever Hyper-V HCI cluster and I found in my testing that iSER sometimes works and sometimes it doesn't. What I mean by that is, if it works (if both nodes show iSER as enabled in the management console), it works but if it doesn't, it is very difficult to get it to work.

What I noticed is that rather than forcing iSER, vSAN seems to use iSER opportunistically. It will try to use iSER but if that fails, it will fall back to non-iSER connections (on the sync interfaces that is).

This opportunistic checking, if iSER should be used, only seems to happen at the start of vSAN. It does not seem to try to activate it later if it didn't work at the start of the service.

Though iSER works sometimes in my test environment, I have encountered the following problems:
  1. iSER is not used at all. All sync interfaces show iSER as disabled in the management console.
  2. iSER only works in one direction. My setup has two nodes and sometimes one node (say: node 1 for example) does have an iSER connection to node 2 but node 2 only has a non-iSER connection to node 1.
One of the reasons that iSER does not work seems to be that when I restart the service, a lingering "zombie" process stays behind. that service will listen on the iSCSI port (UDP 3260, iSER using RoCE v2), preventing vSAN from listening on that port once I start the service back up.

Code: Select all

>netstat -xan | select-string 'PID|3260'

  Mode   IfIndex Type           Local Address          Foreign Address        PID
  User        18 Listener       192.168.51.1:3260      NA                     2964
  User         9 Listener       192.168.52.1:3260      NA                     2965
This is an example, in this case, a zombie process (starwindservice.exe) is listening on a sync interface (192.168.51.1) per RDMA (netstat -x). Yet another process is listening on the other sync interface (192.168.52.1). despite the StarwindService not running. When I start the service, a new process is spawned which cannot bind to the socket/port. In such a case vSAN logs this error.

Code: Select all

2/28 21:04:53.451695 d90 iSER: *** iSerDmServerSocket::Listen: IND2Listener::Bind failed with c0000043
2/28 21:04:53.451715 d90 Srv: *** iScsiServer::listenConnections: iSER listen to 192.168.52.1:3260 failed with 3221225539 (0xc0000043).
By the way, in such a case, I cannot kill the zombie process (starwindservice.exe) without rebooting the computer. Task manager tells me that I do not have the permission to end the process. I ran task manager with administrator privileges.

I wonder,
  • Is there a way to force iSER connections? That is, to tell vSAN to not fall back to a regular iSCSI connection on the sync interface if an iSER connection cannot be established right now?
  • Will vSAN opportunistically try to establish an iSER connection continuously, if that failed during startup, and a regular iSCSI connection was used for syncing?
  • Is there a way to kill the zombie process without rebooting (if the vSAN service is already stopped).
  • Is it on purpose that the starwindservice.exe process is protected from being killed?
  • What could be the reason why the zombie process is left behind anyway, if the service is stopped?
    I tried using both of these methods:

    Code: Select all

    get-service StarwindService | restart-service
    and

    Code: Select all

    & 'C:\Program Files\StarWind Software\StarWind\ServiceLaunchMgr.exe' /servicename:starwindservice /operation:restart
    In both cases, sometimes it doesn't seem to properly shut down and end the process.
yaroslav (staff)
Staff
Posts: 2359
Joined: Mon Nov 18, 2019 11:11 am

Fri Mar 01, 2024 9:50 am

Thanks for sharing your observations and questions.
Can you please tell me more about the networks and the setup?
Which NICs do you use and is it a bare-metal deployment or is it a nested cluster?
SiegfriedB
Posts: 13
Joined: Thu Oct 12, 2023 6:20 pm

Sat Mar 02, 2024 12:54 pm

yaroslav (staff) wrote:
Fri Mar 01, 2024 9:50 am
Thanks for sharing your observations and questions.
Can you please tell me more about the networks and the setup?
Which NICs do you use and is it a bare-metal deployment or is it a nested cluster?
Hi, ty for your reply!
The setup is:
  • Windows vSAN v8.0.0.0 build 15260 on bare metal
  • (free) Hyper-V Server 2019 with all updates installed
  • AMD EPYC 7302P on Supermicro H12SSL-NT

Each Intel adapter is set up like this:
  • Network cabling: Direct connection via DAC copper cable (1m length) to the other node.
  • latest driver (29.0, date 2024-02-14) and firmware (NVM version 4.64):
  • Net Direct provider installed (required for vSAN iSER): Intel Ethernet User Mode RDMA Provider
  • Driver settings as follows (only showing relevant settings (as far as I am able to judge what is relevant anyway)).

    Code: Select all

    >Get-NetAdapterAdvancedProperty -Name "Ethernet 4" | select-object DisplayName, DisplayValue
    
    DisplayName                            DisplayValue
    -----------                            ------------
    Flow Control                           Rx & Tx Enabled
    Jumbo Packet                           9014 Bytes
    NetworkDirect Functionality            Enabled
    NetworkDirect Technology               RoCEv2
    ROCEv2 Frame Size                      4096
Image

A few notes:
  • At this stage, there is no failover cluster yet. I am simply testing vSAN without Hyper-V, without Failover Cluster role.
  • Both nodes have the same hardware and software settings.
  • The Intel adapters also supports iWarp. Should I try iSER over iWARP instead of RoCEv2? I don't know if that is even an option.
  • RoCE v2 frame size 4096 is the highest setting. The driver offers these options: 1024, 2048, 4096.
Lastly, to give an example
  • vSAN service is running
  • I stop the service. The service seeminlgy shuts down properly.
  • starwindservice.exe (PID 2372) is still running in the background, binding to RDMA port 3260.
  • I start the service and a new starwindservice.exe process (PID 3972) is started.
  • The process with the new PID cannot bind to the sync interfaces because the old process is still listening on the sync interfaces (192.168.51.2:3260, 192.168.52.2:3260)
  • You can see in task manager that starwindservice.exe is running twice.
  • I cannot even kill the process.
Image
yaroslav (staff)
Staff
Posts: 2359
Joined: Mon Nov 18, 2019 11:11 am

Sat Mar 02, 2024 4:41 pm

My colleagues reported several "odd" things with Intel x810s and RDMA while testing NVMe-oF. I'd assume it to be relevant for iSER too. Also, make sure to have the latest drivers installed.
StarWind VSAN does not work over iWARP, you need to stick with RoCE. Try the default MTU values.
Let me know how it works with RoCE and default MTU values.
flashme44
Posts: 1
Joined: Wed Apr 03, 2024 4:57 pm

Wed Apr 03, 2024 5:05 pm

I can confirm that this problem exists. We see this Problem on ALL of our customers which are using Bare Metal or PCie Passthrough Installation of StarWind VSAN on Windows with iSER enabled Adapters. Regardless which NIC Vendor they are using. It happens on Mellanox/Nvidia and Intel Cards. Iser is sometimes not working after restart of Starwind vsan Services like thread Opener mentioned. Its an annoying Problem because fast and especially full sync runs longer than. Hopefully nvme-of will be implemented soon.. On Facebook sote your Marketing said it will be available in february.. Its April now.
yaroslav (staff)
Staff
Posts: 2359
Joined: Mon Nov 18, 2019 11:11 am

Wed Apr 03, 2024 6:03 pm

Thanks for your post.
Sadly, I have no clear ETA for the build. Originally, it was February (end of Q1), but now it seems June.
SiegfriedB
Posts: 13
Joined: Thu Oct 12, 2023 6:20 pm

Mon Apr 08, 2024 4:37 pm

yaroslav (staff) wrote:
Wed Apr 03, 2024 6:03 pm
Thanks for your post.
Sadly, I have no clear ETA for the build. Originally, it was February (end of Q1), but now it seems June.
Thank you for the updated ETA.

As far as the original issue goes (why I created this post), lingering processes after restarting the service: I have been unable to reproduce the issue and I have moved on from testing with the Windows app to testing with the CVM, so I am unlikely to revisit this.
yaroslav (staff)
Staff
Posts: 2359
Joined: Mon Nov 18, 2019 11:11 am

Mon Apr 08, 2024 9:42 pm

Thank you for your update!

p.s. I will be glad to read your feedback about CVM.
Post Reply