Random Pausing under medium / high load

Software-based VM-centric and flash-friendly VM storage + free version

Moderators: anton (staff), art (staff), Anatoly (staff), Max (staff)

Post Reply
liquidkristal
Posts: 6
Joined: Tue Sep 17, 2013 7:33 pm

Tue Sep 17, 2013 10:23 pm

Hi.

We have an interesting problem, the only downside is that users are noticing it and complaining.

We have 1 starwind node running version 6, its a Windows 2008 R2 machine: 32GB Mem, AMD FX 6200 (3.8Ghz) ona Gigabyte Board, disks are handled by an LSI Megaraid SAS 9260-16i, with all 16 channels populated. The storage is configured as RAID-50 and the OS is served of a pair of SSD's in RAID 1

We have 24 VM's running on the array as it stands and performance is ok (We have the 50% Round robin problem, but thats not my concern for now), my concern is that every few hours we get a pause, the performance graphs within starwind drop to 0, the latency for the disk / storage system on the VM hosts goes from around 20ms to anywhere from 200ms to 800ms for about 3 seconds, then it all goes back to normal.

The Switches are Cisco 2960S switches with all traffic contained within 1 switch, I'm monitoring the switch to see if its having any buffer issues, but I thought I'd ask the question, flow control is off and at the moment Jumbo frames are off as well.

Any ideas.
User avatar
anton (staff)
Site Admin
Posts: 4021
Joined: Fri Jun 18, 2004 12:03 am
Location: British Virgin Islands
Contact:

Tue Sep 17, 2013 10:36 pm

Is there any chance you'd say on a weekend put the heavy load test from the VMs to see would it happen faster and also provide us with a StarWind long (it would be HUGE so make sure you zip it and put for download @ your location). If there are aborted commands we'll see this. Also please let us know amount of cache you use and cache policy used as well. Thanks!
Regards,
Anton Kolomyeytsev

Chief Technology Officer & Chief Architect, StarWind Software

Image
liquidkristal
Posts: 6
Joined: Tue Sep 17, 2013 7:33 pm

Tue Sep 17, 2013 10:44 pm

I know posting of logfiles is frowned upon, but here is a tiny section:

Code: Select all

9/17 23:10:49.973 9a0 Ssc: *** SscScsi_InquiryHandler: INQUIRY VPD page 0xb0 is not supported!
9/17 23:10:49.989 9a8 Ssc: *** SscScsi_InquiryHandler: INQUIRY VPD page 0xb0 is not supported!
9/17 23:10:57.602 9a8 IMG: *** ImageFile_ScsiExec: WRITE_SAME (0x93) is not supported.
9/17 23:11:03.233 9a0 Ssc: *** SscScsi_InquiryHandler: INQUIRY VPD page 0xb0 is not supported!
9/17 23:11:03.249 9a8 Ssc: *** SscScsi_InquiryHandler: INQUIRY VPD page 0xb0 is not supported!
I can post a full log later on.

I get a lot of that, caching on the device is write-back and cache size is 512, its the free version while we evaluate it fully (so that drops to something smaller if memory serves).
User avatar
kspare
Posts: 60
Joined: Sun Sep 01, 2013 2:23 pm

Tue Sep 17, 2013 11:12 pm

Are you using thin or eager didks?
User avatar
Ironwolf
Posts: 59
Joined: Fri Jul 13, 2012 4:20 pm

Wed Sep 18, 2013 3:57 pm

Just a thought here,

Sounds like one of the HDDs are getting ready to fail or the controller is way too hot, LSI MegaRAID ‘s do not have an adequate heat sink and air flow. We had to add an auxiliary fan next our controllers just to keep them from overheating.

Open the MegaRAID storage manager, on the physical tab, scan through the drives you have attached, and look at Temperature, Media Error Count, Pred Fail Count, as for the controller look at the Chip Temperature it will max out at about 192F and take no further readings, with our Aux fans blowing directly on the heat sinks, we can get it down to 161F, but as per LSI’s documentation140F is the max operational Temperature. We don’t have any problems at 161F but when its off the scale the card repeatedly reboots on us causing delays and hiccups.
User avatar
anton (staff)
Site Admin
Posts: 4021
Joined: Fri Jun 18, 2004 12:03 am
Location: British Virgin Islands
Contact:

Wed Sep 18, 2013 5:29 pm

These ones are not related to the issue.

P.S. One of the reasons we want to have full logs :)
liquidkristal wrote:I know posting of logfiles is frowned upon, but here is a tiny section:

Code: Select all

9/17 23:10:49.973 9a0 Ssc: *** SscScsi_InquiryHandler: INQUIRY VPD page 0xb0 is not supported!
9/17 23:10:49.989 9a8 Ssc: *** SscScsi_InquiryHandler: INQUIRY VPD page 0xb0 is not supported!
9/17 23:10:57.602 9a8 IMG: *** ImageFile_ScsiExec: WRITE_SAME (0x93) is not supported.
9/17 23:11:03.233 9a0 Ssc: *** SscScsi_InquiryHandler: INQUIRY VPD page 0xb0 is not supported!
9/17 23:11:03.249 9a8 Ssc: *** SscScsi_InquiryHandler: INQUIRY VPD page 0xb0 is not supported!
I can post a full log later on.

I get a lot of that, caching on the device is write-back and cache size is 512, its the free version while we evaluate it fully (so that drops to something smaller if memory serves).
Regards,
Anton Kolomyeytsev

Chief Technology Officer & Chief Architect, StarWind Software

Image
User avatar
anton (staff)
Site Admin
Posts: 4021
Joined: Fri Jun 18, 2004 12:03 am
Location: British Virgin Islands
Contact:

Wed Sep 18, 2013 5:30 pm

Very probable. That's why we want the full log with a reads or writes reported being delayed.
Ironwolf wrote:Just a thought here,

Sounds like one of the HDDs are getting ready to fail or the controller is way too hot, LSI MegaRAID ‘s do not have an adequate heat sink and air flow. We had to add an auxiliary fan next our controllers just to keep them from overheating.

Open the MegaRAID storage manager, on the physical tab, scan through the drives you have attached, and look at Temperature, Media Error Count, Pred Fail Count, as for the controller look at the Chip Temperature it will max out at about 192F and take no further readings, with our Aux fans blowing directly on the heat sinks, we can get it down to 161F, but as per LSI’s documentation140F is the max operational Temperature. We don’t have any problems at 161F but when its off the scale the card repeatedly reboots on us causing delays and hiccups.
Regards,
Anton Kolomyeytsev

Chief Technology Officer & Chief Architect, StarWind Software

Image
liquidkristal
Posts: 6
Joined: Tue Sep 17, 2013 7:33 pm

Wed Sep 18, 2013 9:59 pm

The disks are all thick / lazy drives.

the controller doesn't give its temperature, the disks are all at 27C and are all showing as healthy, no media / smart errors.

I've had a look through the logs, no reads / writes being reported as delayed, 1 thing I have noticed is that the ports on the switch where the starwind box is connected are recording a load of output drops.

my initial guess is that starwind is working as expected as is the RAID controller, but the Switch (Cisco 2960S) is being overloaded causing momentary high latency.
User avatar
anton (staff)
Site Admin
Posts: 4021
Joined: Fri Jun 18, 2004 12:03 am
Location: British Virgin Islands
Contact:

Wed Sep 18, 2013 10:11 pm

That's easy to isolate. Provide an extra physical path for data and make StarWind / hypervisor use both.
liquidkristal wrote:The disks are all thick / lazy drives.

the controller doesn't give its temperature, the disks are all at 27C and are all showing as healthy, no media / smart errors.

I've had a look through the logs, no reads / writes being reported as delayed, 1 thing I have noticed is that the ports on the switch where the starwind box is connected are recording a load of output drops.

my initial guess is that starwind is working as expected as is the RAID controller, but the Switch (Cisco 2960S) is being overloaded causing momentary high latency.
Regards,
Anton Kolomyeytsev

Chief Technology Officer & Chief Architect, StarWind Software

Image
liquidkristal
Posts: 6
Joined: Tue Sep 17, 2013 7:33 pm

Wed Sep 18, 2013 10:31 pm

there are 4 hypervisors, I'll need to look into swapping the switch out or enabling jumbo's across all devices (they are currently disabled)
User avatar
anton (staff)
Site Admin
Posts: 4021
Joined: Fri Jun 18, 2004 12:03 am
Location: British Virgin Islands
Contact:

Thu Sep 19, 2013 6:11 pm

Good idea. Please let us know did it work or not. Thank you!
liquidkristal wrote:there are 4 hypervisors, I'll need to look into swapping the switch out or enabling jumbo's across all devices (they are currently disabled)
Regards,
Anton Kolomyeytsev

Chief Technology Officer & Chief Architect, StarWind Software

Image
Post Reply