raw image file from v6 to v8

Software-based VM-centric and flash-friendly VM storage + free version

Moderators: anton (staff), art (staff), Max (staff), Anatoly (staff)

sunyucong
Posts: 43
Joined: Mon Sep 12, 2011 8:21 am

Thu Oct 02, 2014 8:01 pm

Hi,

Just upgraded my v6 installation to v8 (7145) yesterday, it was configured with raw image file, and within 24h starwind died and reporting this error: (I had to force kill the starwind process and restart)

The log file name: 'logs\starwind-20141002-120918.log'
=========== The previous log file: 'logs\starwind-20141002-120659.log' ===========
StarWind Virtual SAN for vSphere v8.0.0 (Build 7145, [SwSAN], Win64)
Built Sep 5 2014 13:16:35
Windows Server 2008 R2 Enterprise Edition Service Pack 1 (Build 7601)
Kernel version information (kernel32.dll): 6.1.7601
10/2 12:09:18.871 22fc SCSI: VAAI C&W: WRITE (0x2A) returned CHECK CONDITION (3/0/0)!
10/2 12:09:18.871 1c2c SCSI: VAAI C&W: WRITE (0x2A) returned CHECK CONDITION (3/0/0)!
10/2 12:09:18.871 1718 SCSI: VAAI C&W: WRITE (0x2A) returned CHECK CONDITION (3/0/0)!
10/2 12:09:18.871 2254 SCSI: VAAI C&W: WRITE (0x2A) returned CHECK CONDITION (3/0/0)!
10/2 12:09:18.871 24c0 SCSI: VAAI C&W: WRITE (0x2A) returned CHECK CONDITION (3/0/0)!
10/2 12:09:18.886 3ad4 SCSI: VAAI C&W: WRITE (0x2A) returned CHECK CONDITION (3/0/0)!
10/2 12:09:18.902 28d4 IMG: *** ImageFile_ScsiExec: Deferred error reported for session id 3005.
10/2 12:09:18.902 2b40 IMG: *** ImageFile_ScsiExec: Deferred error reported for session id 2575.
10/2 12:09:18.902 2b40 IMG: *** ImageFile_ScsiExec: Deferred error reported for session id 1227.
10/2 12:09:18.902 2b40 IMG: *** ImageFile_ScsiExec: Deferred error reported for session id 2077.
10/2 12:09:18.902 28d4 IMG: *** ImageFile_ScsiExec: Deferred error reported for session id 3025.
10/2 12:09:18.902 2b40 IMG: *** ImageFile_ScsiExec: Deferred error reported for session id 2925.
10/2 12:09:18.902 2b40 IMG: *** ImageFile_IoCompleted: Error (1453) returned to IO completion!
10/2 12:09:18.902 2b40 IMG: *** ImageFile_ReadWriteSectorsCompleted: Error occured (ScsiStatus = 2, DataTransferLength = 0)!
10/2 12:09:18.902 2b40 IMG: *** ImageFile_ReadWriteWithCacheCompleted: Error (0xC0000001) returned to cache request completion!
10/2 12:09:18.902 1c2c SCSI: VAAI C&W: WRITE (0x2A) returned CHECK CONDITION (3/0/0)!
10/2 12:09:18.902 2b40 IMG: *** ImageFile_IoCompleted: Error (1453) returned to IO completion!
10/2 12:09:18.902 2b40 IMG: *** ImageFile_ReadWriteSectorsCompleted: Error occured (ScsiStatus = 2, DataTransferLength = 0)!
10/2 12:09:18.902 2b40 IMG: *** ImageFile_ReadWriteWithCacheCompleted: Error (0xC0000001) returned to cache request completion!
10/2 12:09:18.902 22fc SCSI: VAAI C&W: WRITE (0x2A) returned CHECK CONDITION (3/0/0)!
10/2 12:09:18.902 28d4 IMG: *** ImageFile_ScsiExec: Deferred error reported for session id 3005.
10/2 12:09:18.902 20a8 IMG: *** ImageFile_ScsiExec: 10/2 12:09:18.902 2b40 IMG: *** ImageFile_IoCompleted: Error (1453) returned to IO completion!
10/2 12:09:18.902 2b40 IMG: *** ImageFile_ReadWriteSectorsCompleted: Error occured (ScsiStatus = 2, DataTransferLength = 0)!
10/2 12:09:18.902 2b40 IMG: *** ImageFile_ReadWriteWithCacheCompleted: Error (0xC0000001) returned to cache request completion!
Deferred error reported for session id 2575.
10/2 12:09:18.902 24c0 SCSI: VAAI C&W: WRITE (0x2A) returned CHECK CONDITION (3/0/0)!
10/2 12:09:18.902 2b40 IMG: *** ImageFile_IoCompleted: Error (1453) returned to IO completion!
sunyucong
Posts: 43
Joined: Mon Sep 12, 2011 8:21 am

Thu Oct 02, 2014 8:31 pm

Even under normal operation , the log is filled with

10/2 13:25:09.633 1f20 IMG: *** ImageFile_ScsiExec: SCSIOP (0x4D) is not supported.
10/2 13:25:09.695 30a4 C[ec], LIN: Send: semaphore is not signalled!
10/2 13:25:09.851 3d48 IMG: *** ImageFile_ScsiExec: SCSIOP (0x4D) is not supported.
10/2 13:25:09.867 3bf8 IMG: *** ImageFile_ScsiExec: SCSIOP (0x4D) is not supported.
10/2 13:25:12.597 2748 C[16b], LIN: Send: semaphore is not signalled!
10/2 13:25:14.016 2198 C[11a], LIN: Send: semaphore is not signalled!
10/2 13:25:16.060 484 C[49], LIN: Send: semaphore is not signalled!
10/2 13:25:17.495 3254 C[1d6], LIN: Send: semaphore is not signalled!
10/2 13:25:18.150 2aa4 C[5b], LIN: Send: semaphore is not signalled!
sunyucong
Posts: 43
Joined: Mon Sep 12, 2011 8:21 am

Fri Oct 03, 2014 5:18 am

ugh, it happened again within 8 hour.

I'm going to revert to v6 very soon...
sunyucong
Posts: 43
Joined: Mon Sep 12, 2011 8:21 am

Sat Oct 04, 2014 12:54 am

OK, so I think the key bug here is

*** ImageFile_IoCompleted: Error (1453) returned to IO completion!

this is a obvious bug, as outlined in here: https://stackoverflow.com/questions/344 ... -set-quota

"
The problem is that ReadFile is getting called more times than GetOverlappedResult. Causing the process to run out of resources for dealing with all the outstanding reads.
"

Can some one from dev team comment if this will get fixed? otherwise I will have to revert to v6.
User avatar
Anatoly (staff)
Staff
Posts: 1675
Joined: Tue Mar 01, 2011 8:28 am
Contact:

Tue Oct 07, 2014 1:33 pm

Hi! Is there any chance you could share the complete log with us? Uploading it to some file hosting and sharing the link should be sufficient.
Best regards,
Anatoly Vilchinsky
Global Engineering and Support Manager
www.starwind.com
av@starwind.com
sunyucong
Posts: 43
Joined: Mon Sep 12, 2011 8:21 am

Wed Oct 08, 2014 4:30 am

https://drive.google.com/file/d/0B2iH17 ... sp=sharing

I've revert back to v8 v6884. This version has the same issue : (search for 1453 error). but when it happens it doesn't hose the entire service, only disconnecting the erroring session. Please let me know if this log would be sufficient to help you pinpoint the issue.

But , as I said, make sure ReadFileEx & GetOverlappResult is always paired is the issue.
sunyucong
Posts: 43
Joined: Mon Sep 12, 2011 8:21 am

Thu Oct 09, 2014 4:44 pm

resending the log in case you missed it

HERE is the log:

https://drive.google.com/file/d/0B2iH17 ... sp=sharing

Has there been any progress?
User avatar
Anatoly (staff)
Staff
Posts: 1675
Joined: Tue Mar 01, 2011 8:28 am
Contact:

Mon Oct 13, 2014 11:46 am

The logs are passed to R&D now.

What I can say right now is for some reason that I can`t see any records about VAAI that you have mentioned initially.
Best regards,
Anatoly Vilchinsky
Global Engineering and Support Manager
www.starwind.com
av@starwind.com
sunyucong
Posts: 43
Joined: Mon Sep 12, 2011 8:21 am

Tue Oct 14, 2014 6:21 pm

Best I can debug externally is this is definitely related to "EXTENDED COPY" VAAI. ( clone new vm between two target on same machine).

both current version v8 and previous V8 has same issue, and they are both triggered by this, current version hang immediately, and previous version dropped the session and reestablish after a bit. but the CLONE will fail 99% time.
User avatar
Anatoly (staff)
Staff
Posts: 1675
Joined: Tue Mar 01, 2011 8:28 am
Contact:

Thu Oct 16, 2014 10:59 am

I understand that, but the logs that you provided does`nt have a single string with "VAAI", so we cannot get to the bottom of the issue
Best regards,
Anatoly Vilchinsky
Global Engineering and Support Manager
www.starwind.com
av@starwind.com
sunyucong
Posts: 43
Joined: Mon Sep 12, 2011 8:21 am

Fri Oct 17, 2014 2:25 am

my log files are gone, which I can't regenerate because I already switched back to V6 and don't want to risk problem again.

I was hoping you guys have plenty environment to replicate this setup , which IMO is really basic (1 starwind machines two target, with img device directly configured in v6 way, no swdsk thing. and have a vm vmotion between those two targets, the error shows up immediately. )

And judging from the error message, it is a really basic error, not a race condition or any of that, in my experience this error can also be found by close examine the code in question ,maybe you guys already did that.

Here I am just a long time user trying to report a look-to-me pretty serious defect, if i'm not being helpful then forget it.
User avatar
Anatoly (staff)
Staff
Posts: 1675
Joined: Tue Mar 01, 2011 8:28 am
Contact:

Fri Oct 17, 2014 9:58 am

We will obviously try, but can I ask you to provide us with the detailed description of actions that you have done before facing the issue? Also we will require detailed diagram of your setup.

Thank you
Best regards,
Anatoly Vilchinsky
Global Engineering and Support Manager
www.starwind.com
av@starwind.com
sunyucong
Posts: 43
Joined: Mon Sep 12, 2011 8:21 am

Fri Oct 17, 2014 9:18 pm

OK: here is more detailed version of my setup:

starwind machines with v6 software, have a bunch of img devices attached to separate targets , each with a read file cache

install starwind v8 over v6 software, the server came up without issue, all targets are fine too. Until after some clone operation triggers EXTENDED COPY (it may take several try) and soon everything staled .

starwind machine: 4 eth ports with different IP 10.X.X.5 - 8, esxi machine (no port binding) , is seeing 4 path and uses round-robin MPIO (this may or may not be important).
User avatar
Anatoly (staff)
Staff
Posts: 1675
Joined: Tue Mar 01, 2011 8:28 am
Contact:

Mon Oct 20, 2014 6:40 pm

OK, looks like we got to the bottom of the issue. We`ll include the fix into the next minor update. Stay tuned!
Best regards,
Anatoly Vilchinsky
Global Engineering and Support Manager
www.starwind.com
av@starwind.com
sunyucong
Posts: 43
Joined: Mon Sep 12, 2011 8:21 am

Wed Oct 29, 2014 12:28 am

That's great, when can we expect a new release? the release cycle seems pretty long for such a critical issue: am I the only one that has run into this? What's so special about it? Any workaround?
Post Reply