Starwind, Hyper-V, and Windows 2003: delayed write failure

Software-based VM-centric and flash-friendly VM storage + free version

Moderators: anton (staff), art (staff), Max (staff), Anatoly (staff)

Post Reply
User avatar
Aitor_Ibarra
Posts: 163
Joined: Wed Nov 05, 2008 1:22 pm
Location: London

Fri May 15, 2009 4:57 pm

Hi,

Short: I think this is more a problem with Windows 2003, but could perhaps be addressed by a new feature in Starwind

Long:

I'm using Starwind as iSCSI storage for a Windows 2008 cluster running Hyper-V. On this cluster there are a number of virtual machines, mostly Windows 2008 and Windows 2003. The server running Starwind is running Windows 2008. It has a nice Areca RAID controller, and lots of disks, arranged as RAID 1 volumes.

Each VM has some config data, and one or more VHD files, held in a single iSCSI target served by Starwind. Each target is an img file.

I decided to move some of my less important .img files from a fast 10Krpm RAID 1 to a slower 7200Krpm RAID 1. At the time, the VMs had been removed, so Hyper-V wasn't using the targets. First, I deleted the target in the Starwind console, then I moved the .img file from one disk to the other using windows explorer, then once copied I recreated the target, pointing Starwind at the new location of the .img file.

While the .img was being copied, *all* the Windows 2003 VMs started having "Delayed Write Failures" to thier local disks (which are actually VHDs stored on iSCSI targets). Windows 2008 VMs were fine. The VMs quickly became unstable and had to be rebooted. I've since repeated the test a few times, and every time, I get the same problem. The only thing I haven't tried is restarting Starwind. I think that's what I will have to do - gracefully shut down every VM on the cluster, move the .img files around, reconfigure the targets on Starwind, and then restart Starwind - finally startup the VMs.

I think the problem is that during the file copy, there is a hell of a lot more disk activity than usual on both the source and destination disks, and because both of these were simutaneously being used by Starwind for the running VMs, delayed write failures started occuring. The cache on my RAID controller didn't help as copying a large .img would quickly fill it. I think that because Starwind was unaware of the disk activity, it couldn't do anything. The Windows 2008 VMs didn't have a problem, so I guess Microsoft have made Windows a bit more tolerant in this situation. What I can't really explain is why some of my VMs, which were on a third raid pair not involved in the move had the same problem!

What Starwind could do to help, potentially, is have a function to move .imgs from disk to disk, so that initiators using targets held on the same disks can be told to expect problems, and hopefully that will stop this issue. If that's impossible (e.g. if iSCSI doesn't support that) then I don't know if there is a solution - apart from not using Windows 2003 VMs!

I'm also worried that any other disk activity on the Starwind box could cause similar problems - e.g. defragging a drive holding imgs.
User avatar
Aitor_Ibarra
Posts: 163
Joined: Wed Nov 05, 2008 1:22 pm
Location: London

Fri May 15, 2009 6:05 pm

As another experiment, I'm trying to cause the activity within Starwind. Instead of moving the .img, I'm creating a new one, with the fill with zeroes option to cause the disk activity.

This causes the same problem - delayed write errors in all windows 2003 VMs, even when they are not running off the same RAID pair as the one where the .img is being created.

I can understand why there would be problems with the vhds in imgs on the same RAID pair, when there's a huge amount of disk activity. But I don't understand why VMs running off vhds on other targets, on different physical disks, would have a problem. Surely they get their own threads from Starwind, so the only thing they share is a RAID controller, which has ample bandwidth...
User avatar
Aitor_Ibarra
Posts: 163
Joined: Wed Nov 05, 2008 1:22 pm
Location: London

Fri May 15, 2009 8:43 pm

More updates;

- a complete reboot of the Starwind box didn't cure the problem
- problem seems to only arise when foreground i/o is happening on the starwind box. So, creating a new img that is zeroed out, using windows explorer to move an img - these actions can cause the problem. It doesn't seem to happen if the i/o is caused by a network copy, or if you have multiple VMs doing heavy i/o to iSCSI targets on same RAID set...
- workaround for moving an img - use robocopy with the /ipg option. This slows down the copy. I'm trying a copy using /ipg:1000 (one second between packets) - very slow, but so far hasn't caused any problems in Windows 2003 VMs running on same or other RAID sets
- possible workaround for zeroing out a new img - don't; instead do a non-quick format from the initiator
- Windows 2008 VMs don't complain so much as 2003, but I've had stability issues with them so they aren't perfect.
- if you get the delayed write error in a VM, you really must reboot it to get it stable again
Robert (staff)
Posts: 303
Joined: Fri Feb 13, 2009 9:42 am

Mon May 18, 2009 11:15 am

Hello Aitor,

Thank you for the detailed description of the issue you have faced with.

The issue may relate to the fact that even though you remove the iscsi devices from StarWind management console, the targets are still accessible via starwind service and the clients still actually interact with those targets. In order to avoid it - you would need to remove iscsi connections on the client sides.

This is something we have put on our roadmap and this should be found a workaround to in the next version of StarWind.

Thanks,
Rob.
Robert
StarWind Software Inc.
http://www.starwindsoftware.com
User avatar
Aitor_Ibarra
Posts: 163
Joined: Wed Nov 05, 2008 1:22 pm
Location: London

Wed Jun 24, 2009 12:11 pm

Minor correction: network copys on the starwind box can also cause the problem. Basically avoid any heavy i/o on a starwind box that isn't itself going through Starwind, and that means via an inititator to a stawind target (as zeroing out an img can also cause the problem). I think this is because Windows doesn't know to prioritise Starwind i/o above everything else. The weird thing is that I've seen this happen even when the i/o is to different physical disks than the ones holding the starwind img files.

Perhaps this can be avoided by using pass-through disks, so that there's never any competition between starwind and other windows processes for the same spindles...

This is with Starwind 4.0 byt the way; I've not put 4.1 through this test yet.
Robert (staff)
Posts: 303
Joined: Fri Feb 13, 2009 9:42 am

Thu Jun 25, 2009 12:15 pm

Aitor,

Thank you for the update,

This is something that we are going to reproduce in our test lab, among many other scenarios :). If you are willing to try this on 4.1 and inform us about the results - we would highly appreciate it.

Thanks
Robert
StarWind Software Inc.
http://www.starwindsoftware.com
Post Reply