Starwind 3.1.4 Crashing Service

Software-based VM-centric and flash-friendly VM storage + free version

Moderators: anton (staff), art (staff), Max (staff), Anatoly (staff)

Post Reply
primey
Posts: 9
Joined: Mon Oct 24, 2005 1:58 pm

Wed Nov 01, 2006 3:07 pm

Hi,
I am currently using Starwind to do something quite adventurous but every so often I keep running into problems with Starwind Service Crashing.

The setup:-
Windows Server 2003 R2, Starwind 3.1.4. SQL 2005 SP1 & SQL2000 SP4.
We using Veritas Storage Foundation HA 4.3 to Cluster 4 SQL servers together (2 seperate clusters). 2 physical to 2 Virtual Servers. The Guests sit on a Single Host. Both the Physical servers, and the Virtual Host are SAN attached and ALL the LUNs are presented to the Host, as well as the necessary LUNS presented to the Physicals. Starwind is running on the Virtual Host with all the Data LUNS configured for SPTI. MS ISCSI Initiator in running on the 2 guests. Failover with Veritas does work, but is a bit of a nightmare to set up. The SQL2005 Cluster has the data volumes configured as mountpoints. The SQL2000 Clustes has the data volumes configured as Drive letters. THe issues:-

1. What maybe causing the Starwind Service to Crash?
2. When the Starwind Service crashes, because the Luns maybe in use on the Physical Nodes, Starwind cannot grab these so the SPTI setting are lost. I have to shut down the cluster so that the Luns go "deported" on the physical nodes and re share them manually, or dont re-start the Starwind Service until the Luns are "deported". I also has to do this when setting up the disks originally.

I am surprised to get this error because the prior configuration was Windows 2003 SP1 (Pre R2), Starwind 2.6.4 and SQL2000 and I had none of these issues. If the Virtual host was rebooted, Starwind started up fine, and I was able to SPTI Luns even if they were on use on the Physical Node. Once it was setpu and running, it was fine and needed no manual intervention.

I upgraded to 3.1.4 as 2.6.4 seemed to have issues on Windows 2003 R2, but 3.1.4 seemed to cure them initially, but has brought this connection loss problem into the equation.

Can anyone help?
primey
Posts: 9
Joined: Mon Oct 24, 2005 1:58 pm

Wed Nov 01, 2006 3:23 pm

Here is a snippet of the tail end of a 350meg log file

10/26 20:12:01 a84 Params: <<< String param 'TargetName': received 'virlp5001.prertdom.com.spti:disk.physicaldrive30', accepted 'virlp5001.prertdom.com.spti:disk.physicaldrive30'
10/26 20:12:01 238 Srv: Accepted connection from 10.160.218.77:2805. id assigned - 186674
10/26 20:12:01 a84 Params: <<< Enum param 'AuthMethod': received 'None', accepted 'None'
10/26 20:12:01 a84 S: virlp5001.prertdom.com.spti:disk.physicaldrive30: 1 session(s) are opened, 0 more allowed.
10/26 20:12:01 a84 SPTI: '\\.\PhysicalDrive30': sessions: 1.
10/26 20:12:01 a84 SPTI: '\\.\PhysicalDrive30': Test failed, the device cannot execute SCSI command. (SCSI status 0x0).
10/26 20:12:01 a84 SPTI: '\\.\PhysicalDrive30': openDevice() failed!
10/26 20:12:01 a84 S[0x2ea41]: *ERROR* can't register Session. The device '\\.\PhysicalDrive30' may be owned by a local process!
10/26 20:12:01 a84 S: virlp5001.prertdom.com.spti:disk.physicaldrive30: 0 session(s) are opened, 1 more allowed.
10/26 20:12:01 a84 T[0x4f83]: *ERROR* Login request: device open failed.
10/26 20:12:01 238 C[0x2d932], FREE: Event - CONNECTED.
10/26 20:12:01 238 C[0x2d932], XPT_UP: T3.
10/26 20:12:01 5f0 C[0x2d931], IN_LOGIN: Event - LOGIN_REJECT.
10/26 20:12:01 5f0 C[0x2d931], FREE: T7.
10/26 20:12:01 17d4 C[0x2d932], XPT_UP: Login request: ISID 0x400001370000, TSIH 0x0000.
10/26 20:12:01 17d4 C[0x2d932], XPT_UP: Event - LOGIN.
10/26 20:12:01 17d4 C[0x2d932], IN_LOGIN: T4.
10/26 20:12:01 17d4 Params: <<< String param 'InitiatorName': received 'iqn.1991-05.com.microsoft:dbslp5501.prertdom.com', accepted 'iqn.1991-05.com.microsoft:dbslp5501.prertdom.com'
10/26 20:12:01 17d4 Params: <<< Enum param 'SessionType': received 'Discovery', accepted 'Discovery'
10/26 20:12:01 17d4 Params: <<< Enum param 'AuthMethod': received 'None', accepted 'None'
10/26 20:12:01 17d4 Params: <<< Enum param 'AuthMethod': received 'None', accepted 'None'
10/26 20:12:01 17d4 T[0x1]: Anonymous connection allowed.
10/26 20:12:01 17d4 Params: >>> AuthMethod=None.
10/26 20:12:01 13c8 T[0x1]: The end of stage 0, next stage is 1.
10/26 20:12:01 a84 C[0x2d931], FREE: recvData() error (10058)
10/26 20:12:01 17d4 Params: <<< Enum param 'HeaderDigest': received 'None,CRC32C', accepted 'None'
10/26 20:12:01 17d4 Params: <<< Enum param 'DataDigest': received 'None,CRC32C', accepted 'None'
10/26 20:12:01 17d4 Params: <<< Numeric param 'MaxRecvDataSegmentLength': received 65536, accepted 65536
10/26 20:12:01 17d4 Params: <<< Numeric param 'DefaultTime2Wait': received 0, accepted 2
10/26 20:12:01 17d4 Params: <<< Numeric param 'DefaultTime2Retain': received 60, accepted 20
10/26 20:12:01 17d4 Params: >>> HeaderDigest=None.
10/26 20:12:01 17d4 Params: >>> DataDigest=None.
10/26 20:12:01 17d4 Params: >>> InitialR2T=No.
10/26 20:12:01 17d4 Params: >>> ImmediateData=Yes.
10/26 20:12:01 17d4 Params: >>> MaxRecvDataSegmentLength=65536.
10/26 20:12:01 17d4 Params: >>> MaxBurstLength=262144.
10/26 20:12:01 17d4 Params: >>> FirstBurstLength=65536.
10/26 20:12:01 17d4 Params: >>> DefaultTime2Wait=2.
10/26 20:12:01 17d4 Params: >>> DefaultTime2Retain=20.
10/26 20:12:01 17d4 Params: >>> MaxOutstandingR2T=1.
10/26 20:12:01 17d4 Params: >>> DataPDUInOrder=Yes.
10/26 20:12:01 17d4 Params: >>> DataSequenceInOrder=Yes.
10/26 20:12:01 13c8 T[0x1]: The end of stage 1, next stage is 3.
10/26 20:12:01 13c8 C[0x2d932], IN_LOGIN: Event - LOGIN_ACCEPT.
10/26 20:12:01 13c8 C[0x2d932], LOGGED_IN: T5.
10/26 20:12:01 13c8 C[0x2d932], LOGGED_IN: Event - LOGOUT.
10/26 20:12:01 13c8 C[0x2d932], IN_LOGOUT: T9/10.
10/26 20:12:01 bf4 C[0x2d932], IN_LOGOUT: Event - LOGOUT_ACCEPT.
10/26 20:12:01 bf4 C[0x2d932], FREE: T13.
10/26 20:12:01 238 Srv: accept() failed: An operation on a socket could not be performed because the system lacked sufficient buffer space or because a queue was full. (code: 10055).
10/26 20:12:01 14dc conf: Stop listening to control connections.
10/26 20:12:01 238 Srv: Timeout 10 second(s)...
10/26 20:12:11 238 Srv: Wait for shutdown failed. Terminating....
User avatar
anton (staff)
Site Admin
Posts: 4010
Joined: Fri Jun 18, 2004 12:03 am
Location: British Virgin Islands
Contact:

Wed Nov 01, 2006 8:50 pm

We'll check this out, thank you for pointing... :(
Regards,
Anton Kolomyeytsev

Chief Technology Officer & Chief Architect, StarWind Software

Image
Val (staff)
Posts: 496
Joined: Tue Jun 29, 2004 8:38 pm

Wed Nov 01, 2006 11:57 pm

Hi,

1) Try to use ImageFile targets instead of direct mapping of physical drives with SPTI.

2) I've sent a link to the recent StarWind RC build to you via PM.
Please try it and let us know whether it hepls to solve the issues.

Thank you.
Best regards,
Valeriy
primey
Posts: 9
Joined: Mon Oct 24, 2005 1:58 pm

Thu Nov 02, 2006 12:51 am

I'll give the RC a go as soon as I get chance. I'm on holiday for a couple of days now.

Valery - Image files are not a viable option. Because we are using Veritas Clustering on a SAN based system, we have got Windows volumes Mirrored up. Starwind is installed on our Virtual Host and is acting as a "bridge" between the Physical SAN attached servers, and our Virtual Guests which sit on the Host and use MS ISCSI Initator to see the required LUNS. THe Virtaul Host can see all the allocated SAN Luns. We just have them split between the 2 clusters. It sounds more complicated than it actually is.
Val (staff)
Posts: 496
Joined: Tue Jun 29, 2004 8:38 pm

Thu Nov 02, 2006 10:25 pm

> *ERROR* can't register Session. The device '\\.\PhysicalDrive30' may be owned by a local process!

This means that the drive is in use in the physical server.
It's not possible to access the disk via SPTI mapping in the case.
Best regards,
Valeriy
primey
Posts: 9
Joined: Mon Oct 24, 2005 1:58 pm

Mon Nov 06, 2006 1:18 pm

valery (staff) wrote:> *ERROR* can't register Session. The device '\\.\PhysicalDrive30' may be owned by a local process!

This means that the drive is in use in the physical server.
It's not possible to access the disk via SPTI mapping in the case.
Version 2.6.x used to do this. Version 3.x does not.
Val (staff)
Posts: 496
Joined: Tue Jun 29, 2004 8:38 pm

Mon Nov 06, 2006 1:37 pm

primey wrote:
valery (staff) wrote:> *ERROR* can't register Session. The device '\\.\PhysicalDrive30' may be owned by a local process!

This means that the drive is in use in the physical server.
It's not possible to access the disk via SPTI mapping in the case.
Version 2.6.x used to do this. Version 3.x does not.
The implementation of the feature in StarWind 2.6 was unsafe.
This feature has been corrected so the server must lock all the volumes from the shared SPTI drive before it allows the remote client to access it.
Otherwise the drive's logical structure can be damaged when the local system and the remote one will write to it.

In your case the drive was being used by something, so it failed to be locked.
Last edited by Val (staff) on Mon Nov 06, 2006 1:58 pm, edited 1 time in total.
Best regards,
Valeriy
primey
Posts: 9
Joined: Mon Oct 24, 2005 1:58 pm

Mon Nov 06, 2006 1:45 pm

Thanks for the quick reply

Is it possible to have this feature re-enabled or identify a workaround as it worked perfectly for us under 2.6.x. This locking issue is causing more of a headache since upgrading. We never had an issue with it causing corruption.

Is it also possible to identify why the Starwind Service is stopping in the first instance. If we could trap this, then it buys me some time in one o our environments
Val (staff)
Posts: 496
Joined: Tue Jun 29, 2004 8:38 pm

Mon Nov 06, 2006 2:11 pm

primey wrote:Thanks for the quick reply

Is it possible to have this feature re-enabled or identify a workaround as it worked perfectly for us under 2.6.x. This locking issue is causing more of a headache since upgrading. We never had an issue with it causing corruption.

Is it also possible to identify why the Starwind Service is stopping in the first instance. If we could trap this, then it buys me some time in one o our environments
We could re-enable the mode in the next version, but anyway it's unsafe and does able to lead to data losses.

According to the problem with unstable StarWind 3.1.4 under 2003 R2. -- Have you tried this version with 2003 SP1 to be sure that no OS relaited issues with your storage are there?
This could shed some light on the problem...
Best regards,
Valeriy
primey
Posts: 9
Joined: Mon Oct 24, 2005 1:58 pm

Mon Nov 06, 2006 2:16 pm

Hi,
"I am surprised to get this error because the prior configuration was Windows 2003 SP1 (Pre R2), Starwind 2.6.4 and SQL2000"

This was in my original post so we have used SP1.

We have been getting 300meg log files generated. Is there anything I could scan within them?

Cheers - Paul
primey
Posts: 9
Joined: Mon Oct 24, 2005 1:58 pm

Mon Nov 06, 2006 2:31 pm

I'd like to to see it re-introduced.

We use Veritas Clustering and this has Intelligent disk handling and working with Starwind 2.6.4 performed faultlessly between our Physical and Virtual Nodes in the Cluster. Not one bit of Disk Corruption occurred

Starwind 3.x under R2 has brought with it quite a few issues which I'm trying to resolve. How soon could this be re-implemented in the next point release of Starwind?
Val (staff)
Posts: 496
Joined: Tue Jun 29, 2004 8:38 pm

Mon Nov 06, 2006 2:38 pm

primey wrote:I'd like to to see it re-introduced.

We use Veritas Clustering and this has Intelligent disk handling and working with Starwind 2.6.4 performed faultlessly between our Physical and Virtual Nodes in the Cluster. Not one bit of Disk Corruption occurred

Starwind 3.x under R2 has brought with it quite a few issues which I'm trying to resolve. How soon could this be re-implemented in the next point release of Starwind?
I think we'll be able to provide you a test version with this feature enabled in 1-2 days.
I'll let you know when it's ready.
Best regards,
Valeriy
Post Reply