Windows iSCSI Initiator 'Target Error'

Software-based VM-centric and flash-friendly VM storage + free version

Moderators: anton (staff), art (staff), Max (staff), Anatoly (staff)

Post Reply
jamestaylor
Posts: 14
Joined: Thu Aug 31, 2023 2:42 pm

Tue Sep 24, 2024 9:54 am

Hi all,

We've managed to get our VSAN CVM (Hyper-V) setup working really well so far and are super pleased with the results!

So far our issues have come down to hardware faults or our own misconfiguration.

One of the final remaining issues we have is that our windows iSCSI initiators often return 'Target Error' when trying to make a new connection/session to VSAN.

I've read several times that the Windows iSCSI initiator implementation is flaky so this may just be par for the course, but I thought I'd check....

We have two VSAN nodes, and two data/target networks, thus we have 4x paths to each HA target from our Windows initiator nodes.

After several attempts and sometimes after leaving a given windows initiator node for a few hours, we can get at least one connection/session to each VSAN node for a given target, but getting a session on each of the 4x paths per target can take a lot of attempts (luckily we're using powershell).

Yaroslav previously helpfully advised we should use multiple sessions/connections to improve performance for SSD/NVMe (we have both) but I'm not sure given we have 4x paths how many sessions we should have per path (2x would mean 8x per target and so on) - if you could clarify Yaroslav (or someone else) please that would be fantastic.

Here is an example of a powershell command which resulted in 'Target Error':

Code: Select all

Connect-IscsiTarget -NodeAddress 'iqn.2008-08.com.starwindsoftware:10.X.X.X-03-04-csv-02-02' -TargetPortalAddress 'X.X.3.173' -InitiatorPortalAddress 'X.X.3.106' -InitiatorInstanceName 'ROOT\iScsiPrt\0000_0' -IsMultipathEnabled $true -IsPersistent $true
In the logs of the VSAN node, we see:

Code: Select all

9/19 5:37:12.452106 9 Srv: iScsiServer::listenConnections: Accepted iSCSI connection from X.X.3.106:59753 to X.X.3.173:3260. (Id = 0x10b8)
9/19 5:37:12.452218 9 S[10b8]: iScsiSession::iScsiSession: Session (00007F36791C4C80)
9/19 5:37:12.452235 9 C[10b8], FREE: iScsiConnection::doTransition: Event - CONNECTED.
9/19 5:37:17.584373 a4 Srv: *** SwSocket::Recv: Swn_SocketRecv() failed with error 10035 (0x2733)!
9/19 5:37:17.584532 a4 C[10b8], XPT_UP: iScsiConnection::recvData: Recv returned 10035 (0x2733)!
9/19 5:37:17.584641 a4 C[10b8], XPT_UP: iScsiConnection::receive: recvData returned error 10035 (0x2733)!
9/19 5:37:17.584684 a4 C[10b8], XPT_UP: iScsiConnection::recvWorker: *** 'recv' thread: recv failed 10058.
9/19 5:37:17.685115 12c S[10b8]: iScsiSession::~iScsiSession: ~Session
I noticed they seem to timeout after 5 seconds, so tried changing the 'TcpRecvTimeout' setting in StarWind.cfg to "-20" but that just seemed to lengthen the time it took to fail (I think).

I also read in the forum that setting "TcpKeepAlivePeriod" to "20" could help with Windows initiator but that didn't work either (and the forum post noted that it was fixed in a VSAN version some time ago).

One complication here may be that our target/data networks are infiniband (IPoIB) though everything is very very fast once the initiator connections are established so I'm not sure that's the problem.

We are running "StarWind Virtual SAN (VSAN) v8.0.0 (Build 15469)" - there's a newer version I know, but we've just got it all running sweet (apart from this issue) so I've been reluncant to update just yet! :D However I can do that this weekend if you think that could fix it.

Any hints or tips much appreciated!

Many thanks

James
yaroslav (staff)
Staff
Posts: 2904
Joined: Mon Nov 18, 2019 11:11 am

Tue Sep 24, 2024 11:44 am

Hi,

Welcome back :)
One of the final remaining issues we have is that our windows iSCSI initiators often return 'Target Error' when trying to make a new connection/session to VSAN.
Make sure the storage is synchronized and reachable over iSCSI. Make sure no firewall. Also, make sure iSCSI IPs are supplied correctly.
Make also sure that the IscsiDiscoveryInterfaces is set to 1.
1. Make sure HA devices are synchronized.
2. Make sure they are connected over iSCSIl.
3. Stop the starwind-virtual-san.service
4. Edit StarWind.cfg accordingly.
5. Start starwind-virtual-san.service.
6. Wait for fast sync to complete.
7. Repeat for the remaining VM.
After several attempts and sometimes after leaving a given windows initiator node for a few hours, we can get at least one connection/session to each VSAN node for a given target, but getting a session on each of the 4x paths per target can take a lot of attempts (luckily we're using powershell).
Yes, the storage was not synchronized during the first test.
I noticed they seem to timeout after 5 seconds, so tried changing the 'TcpRecvTimeout' setting in StarWind.cfg to "-20" but that just seemed to lengthen the time it took to fail (I think).
Please be careful with this specific timeout.

In general, please review the setting in CVM StarWind.cfg and make sure sync is over.
jamestaylor
Posts: 14
Joined: Thu Aug 31, 2023 2:42 pm

Tue Sep 24, 2024 12:46 pm

Hi Yaroslav!

Nice you hear from you again :D Thanks for your help on this.
Make sure the storage is synchronized and reachable over iSCSI. Make sure no firewall. Also, make sure iSCSI IPs are supplied correctly.
I'm confident all of these things were in order. There doesn't seem to be a reason for it to fail; sometimes it works, not other times.
Make also sure that the IscsiDiscoveryInterfaces is set to 1.
This sounds promising! I will do as you say tonight outside business hours (just in case) and report back.
Yes, the storage was not synchronized during the first test.
Actually it was, or it appeared to be in the Management Console. Do the logs indicate that it wasn't?

Much appreciated :D

James
yaroslav (staff)
Staff
Posts: 2904
Joined: Mon Nov 18, 2019 11:11 am

Tue Sep 24, 2024 8:42 pm

Hi,

The symptoms say that there's something funny going on with iSCSI settings or connectivity to the storage.
Please keep me posted.
jamestaylor
Posts: 14
Joined: Thu Aug 31, 2023 2:42 pm

Wed Sep 25, 2024 8:24 am

Hi Yaroslav!

Unfortunately I had to deal with another issue last night, and when I stopped the service on the first node early this morning it failed to stop (service stop timed out, but log file kept running, had to reboot the CVM in the end to clear it).

So the first node is done now but after the issue with the first node stopping I want to wait until tonight in case the second node also has issues stopping also.

Will report back later tonight hopefully!

Thanks for your help

James
yaroslav (staff)
Staff
Posts: 2904
Joined: Mon Nov 18, 2019 11:11 am

Wed Sep 25, 2024 10:48 am

Hi

That's weird. Can you please share the timestamp and the log from the affected VM with me? I need only a standard CVM support bundle. Let me know if you need assistance.
jamestaylor
Posts: 14
Joined: Thu Aug 31, 2023 2:42 pm

Wed Sep 25, 2024 1:13 pm

Hi Yaroslav,

Thanks - I've emailed the bundle and some supporting info to support@starwind.com (ref 1220872).

Much appreciated

James
yaroslav (staff)
Staff
Posts: 2904
Joined: Mon Nov 18, 2019 11:11 am

Sun Sep 29, 2024 4:57 pm

Hi everyone,

It is a known issue when there are too many iSCSI connections (internal and client). It results in too many descriptor files being opened. If their number exceeds the limit, issues happen. Here's a workaround for CVM versions 15260-15554. It shows the limit of descriptors and their number opened. Proceed only if "actual" number matches the limit.

1. Make sure HA devices are synchronized and connected.
If not synchronized, an outage can happen, proceed with caution. If some targets are not connected to all clients, please connect those.
You can find more details on it in iSCSI Initiators.
2. systemctl stop starwind-virtual-san
3. Grep the ID of StarWind Service:
pgrep StarWind
4. Using the output from the previous command identify the number of hard and soft file descriptors available for the service:
prlimit -n --pid <pid-comes-here>
5.Go to our service process directory and check the number of used descriptors by our service:
cd /proc/ID_OF_SERVICE/fd
ls -l | wc -l
6. In case number of used descriptors is close to the limit, increase the number of descriptors for the service.
prlimit --pid 2694 --nofile=65536:65536
7. Go to /etc/systemd/system/starwind-virtual-san.service Adjust LimitNOFILE paramter. By default this setting will be 65536.
bouleslunch
Posts: 1
Joined: Mon Oct 07, 2024 11:04 am
Contact:

Mon Oct 07, 2024 11:12 am

The target error was resolved when I followed your instructions. Thanks for the answers.
yaroslav (staff)
Staff
Posts: 2904
Joined: Mon Nov 18, 2019 11:11 am

Mon Oct 07, 2024 4:57 pm

Thanks for your update.
jamestaylor
Posts: 14
Joined: Thu Aug 31, 2023 2:42 pm

Wed Oct 09, 2024 8:30 am

Hi everyone,

For anyone else coming across this issue and finding that the FD descriptors fix that Yaroslav posted above doesn't fix it for you - in our case we also had an issue with StarWind iSCSI Accelerator being installed on our windows initiator machines.

Removing this fixed the "Target Error" problem for us. We are using (for historical and now budget reasons) an InfiniBand network to connect our initiators to our CVM/VSAN targets, which has a max MTU of 4k and I think maybe iSCSI Accelerator and such a setup do not work well together.

Once we removed iSCSI accelerator my PS script connected all paths across all HV nodes in a matter of seconds - what a relief!

Special thanks to Yaroslav for his help getting to the bottom of this.

Thanks all

James
yaroslav (staff)
Staff
Posts: 2904
Joined: Mon Nov 18, 2019 11:11 am

Wed Oct 09, 2024 9:05 am

Hi,

Many thanks for your brief.
onenerdyguy
Posts: 3
Joined: Tue Mar 12, 2024 1:03 am

Thu Oct 10, 2024 4:04 pm

Did you also have any issues with Persistant Reservations?

We're having one now that when we drain a node, randomly some of hte csv's hosted on Starwind will fail to release their SCSI PR's.

Do have a ticket open, ref 1226064, but the tech assigned seemed to miss we're running on CVM's vs the older style Windows-based installers, and wanted us to whitelist .img and .swdisk files that are non-applicable.
yaroslav (staff)
Staff
Posts: 2904
Joined: Mon Nov 18, 2019 11:11 am

Thu Oct 10, 2024 4:34 pm

Many thanks for your request. Please wait for the logs to be investigated.
jamestaylor
Posts: 14
Joined: Thu Aug 31, 2023 2:42 pm

Thu Oct 10, 2024 4:38 pm

Hi,

No we didn't have any issues with PRs, are you sure its a PR issue though, as I thought CSVs no longer use PRs (since 2008R2):

https://ramprasadtech.com/cluster-shared-volume/

I think the VSAN logs show SCSI commands such as PRs, maybe check them to see if anything looks off.

James
Post Reply