Guide to recover single node in 3-node HA config

ghamner · Wed Jan 02, 2013 4:26 pm

This is my first post, and I'm very new to SANs in general and StarWind in particular, so I apologize if I'm asking amateur questions.

I have StarWind configured with 3-node mirroring. Each storage node stores the IMG file on a RAID-0 array across 4 drives for performance reasons, leaving data redundancy up to StarWind HA syncing, and this is working great and performing better than expected. However, since the individual nodes have no internal redundancy capabilities on the IMG storage location, I wondered whether there was a procedure/guide on how to recover a single node in this configuration. I am seeing a high SMART error rate on 2 drives on one of my storage nodes and want to replace the drives. Obviously replacing the drives will bring the RAID-0 array down and the storage node itself will no longer have access to its mirrored IMG files (the OS boots from a different set of drives in RAID-10 so the system itself will be running while the data drives are pulled).

So, in short: is there a procedure how to replace and re-sync a storage node when 1 of 3 nodes loses its IMG files? And how does this affect servers using MPIO that may see the new node before it's placed back as a sync partner with the other 2 nodes?

Wed Jan 09, 2013 12:35 pm

Welcome on board sir!

I have StarWind configured with 3-node mirroring. Each storage node stores the IMG file on a RAID-0 array across 4 drives for performance reasons, leaving data redundancy up to StarWind HA syncing, and this is working great and performing better than expected. However, since the individual nodes have no internal redundancy capabilities on the IMG storage location, I wondered whether there was a procedure/guide on how to recover a single node in this configuration. I am seeing a high SMART error rate on 2 drives on one of my storage nodes and want to replace the drives. Obviously replacing the drives will bring the RAID-0 array down and the storage node itself will no longer have access to its mirrored IMG files (the OS boots from a different set of drives in RAID-10 so the system itself will be running while the data drives are pulled).

Actually the procedure is quite simple - you can bring one (broken) node down, and after you`ll bring that node back online with rebuilt RAID you need to add it as the partner through the Replication Manager (right click on device and choose corresponding option). This should be done for every device.

this procedure doesn`t require any downtime.

ghamner · Wed Jan 09, 2013 7:09 pm

That was, indeed much simpler than I had thought. I really like StarWind so far, the interface is extremely intuitive and it just seems to make sense.

I have my 3 nodes synchronized again on all good drives and everything seems to be ok from the StarWind side, but now I have an unusual issue when the initiator from one of my servers connects (only able to test this on one server so far because it results in a BSOD which would cause production issues if I tried it on another server - however I do not think the problem is with the server itself)...

If I try to connect to a second MPIO path to any of the targets that were resync'd I get a BSOD as described in Microsoft KB2277440 which mentions something about MPIO on ALUA enabled targets. Creating a completely new target with 3-node HA, I can connect to all 3 servers with MPIO it works as expected. Unfortunately when I try to install the hotfix in the KB referenced above, it tells me that it's not applicable to my system (Server 2008 R2 x64 Datacenter). The default ALUA option was left as was (unchecked) when I created the targets in the first place.

UPDATE: I tested on a newly set up VM (Srv 08 R2 Standard) and a newly created 3 way target. The same thing happens on this setup too. I think the working setup from above (new target) was coincidental rather than reproducible as it's BSOD'd on an existing target and a completely new one. I found an MS KB2511962 that the hotfix installs, but that does not correct the issue either. I am not sure where to look for the root cause, and am at a loss as to why this is happening all of a sudden. It only throws a fit when I connect to another path to the same target, running connected to just one path is stable. I am leaning towards suspecting something more on the MS/Windows side of things, but some input from seasoned SAN experts would be greatly appreciated.

Fri Jan 11, 2013 2:58 pm

So, just to clarify - when you are trying to connect to the target the client machine have BSOD?

ghamner · Fri Jan 11, 2013 3:25 pm

Correct, when connecting to a second path the client BSODs.

Mon Jan 14, 2013 2:58 pm

OK, and may I ask you if you have debugged what was the reason of the BSOD?

ghamner · Mon Jan 14, 2013 5:40 pm

Yes, it's an OxD1 stop (DRIVER_IRQL_NOT_LESS_OR_EQUAL) and module MSDSM.SYS is identified as the probable cause. This seems to be the exact issue described in MS KB2511962, but application of that hotfix does not resolve. Also here is the full text from the WinDbg results:

Microsoft (R) Windows Debugger Version 6.2.9200.20512 AMD64
Copyright (c) Microsoft Corporation. All rights reserved.

Loading Dump File [C:\Users\ghamner\Desktop\SANTEST minidumps\011413-17609-01.dmp]
Mini Kernel Dump File: Only registers and stack trace are available

Symbol search path is: SRV*c:\symbols*http://msdl.microsoft.com/download/symbols
Executable search path is:
Windows 7 Kernel Version 7601 (Service Pack 1) MP (2 procs) Free x64
Product: Server, suite: TerminalServer SingleUserTS
Built by: 7601.17944.amd64fre.win7sp1_gdr.120830-0333
Machine Name:
Kernel base = 0xfffff800`01c54000 PsLoadedModuleList = 0xfffff800`01e98670
Debug session time: Mon Jan 14 11:10:49.023 2013 (UTC - 6:00)
System Uptime: 3 days 22:21:40.296
Loading Kernel Symbols
...............................................................
................................................................

Loading User Symbols
Loading unloaded module list
....
*******************************************************************************
* *
* Bugcheck Analysis *
* *
*******************************************************************************

Use !analyze -v to get detailed debugging information.

BugCheck D1, {0, 2, 0, fffff880012aad27}

Probably caused by : msdsm.sys ( msdsm!DsmpUpdateTargetPortGroupEntry+2b3 )

Followup: MachineOwner
---------

0: kd> !analyze -v
*******************************************************************************
* *
* Bugcheck Analysis *
* *
*******************************************************************************

DRIVER_IRQL_NOT_LESS_OR_EQUAL (d1)
An attempt was made to access a pageable (or completely invalid) address at an
interrupt request level (IRQL) that is too high. This is usually
caused by drivers using improper addresses.
If kernel debugger is available get stack backtrace.
Arguments:
Arg1: 0000000000000000, memory referenced
Arg2: 0000000000000002, IRQL
Arg3: 0000000000000000, value 0 = read operation, 1 = write operation
Arg4: fffff880012aad27, address which referenced memory

Debugging Details:
------------------

READ_ADDRESS: GetPointerFromAddress: unable to read from fffff80001f02100
GetUlongFromAddress: unable to read from fffff80001f021c0
0000000000000000 Nonpaged pool

CURRENT_IRQL: 2

FAULTING_IP:
msdsm!DsmpUpdateTargetPortGroupEntry+2b3
fffff880`012aad27 488b01 mov rax,qword ptr [rcx]

CUSTOMER_CRASH_COUNT: 1

DEFAULT_BUCKET_ID: WIN7_DRIVER_FAULT_SERVER

BUGCHECK_STR: 0xD1

PROCESS_NAME: System

TRAP_FRAME: fffff8800210ee50 -- (.trap 0xfffff8800210ee50)
NOTE: The trap frame does not contain all registers.
Some register values may be zeroed or incorrect.
rax=fffffa8002441cf8 rbx=0000000000000000 rcx=0000000000000000
rdx=fffffa800243de91 rsi=0000000000000000 rdi=0000000000000000
rip=fffff880012aad27 rsp=fffff8800210efe0 rbp=fffffa8002441cf8
r8=fffffa800243de90 r9=0000000000000050 r10=fffff80001e46880
r11=fffffa80022f80a0 r12=0000000000000000 r13=0000000000000000
r14=0000000000000000 r15=0000000000000000
iopl=0 nv up ei pl nz na pe cy
msdsm!DsmpUpdateTargetPortGroupEntry+0x2b3:
fffff880`012aad27 488b01 mov rax,qword ptr [rcx] ds:00000000`00000000=????????????????
Resetting default scope

LOCK_ADDRESS: fffff80001eceb80 -- (!locks fffff80001eceb80)

Resource @ nt!PiEngineLock (0xfffff80001eceb80) Available

WARNING: SystemResourcesList->Flink chain invalid. Resource may be corrupted, or already deleted.

WARNING: SystemResourcesList->Blink chain invalid. Resource may be corrupted, or already deleted.

1 total locks

PNP_TRIAGE:
Lock address : 0xfffff80001eceb80
Thread Count : 0
Thread address: 0x0000000000000000
Thread wait : 0x0

LAST_CONTROL_TRANSFER: from fffff80001cd2569 to fffff80001cd2fc0

STACK_TEXT:
fffff880`0210ed08 fffff800`01cd2569 : 00000000`0000000a 00000000`00000000 00000000`00000002 00000000`00000000 : nt!KeBugCheckEx
fffff880`0210ed10 fffff800`01cd11e0 : ffff9660`b352b000 fffff800`01dfefbd 00000000`00000000 fffffa80`02441cd0 : nt!KiBugCheckDispatch+0x69
fffff880`0210ee50 fffff880`012aad27 : fffffa80`02441cd0 fffffa80`01eafcb6 fffffa80`01eafca4 00000000`00000001 : nt!KiPageFault+0x260
fffff880`0210efe0 fffff880`012aa78e : fffffa80`018a2cc0 fffffa80`00000000 fffffa80`02441cd0 fffff880`00000001 : msdsm!DsmpUpdateTargetPortGroupEntry+0x2b3
fffff880`0210f050 fffff880`012a6384 : fffffa80`018a2cc0 fffff880`012c0110 fffffa80`00000014 fffffa80`00000028 : msdsm!DsmpParseTargetPortGroupsInformation+0x19a
fffff880`0210f0e0 fffff880`0101a49d : 00000000`00000000 fffffa80`0288f2c0 fffffa80`4f49504d 00000000`00000000 : msdsm!DsmInquire+0xbfc
fffff880`0210f2c0 fffff880`0100f41a : fffffa80`01aa0050 00000000`00000000 00000000`00000000 fffffa80`00000002 : mpio!MPIOAddSingleDevice+0x191
fffff880`0210f390 fffff880`0100e776 : 00000000`00000000 fffffa80`0287c480 fffff880`0197d500 00000000`00000000 : mpio!MPIODeviceRegistration+0x82
fffff880`0210f400 fffff880`0100e7dd : fffffa80`01aa0050 00000000`00000020 fffffa80`01bc8b70 fffff880`0210f490 : mpio!MPIOFdoDispatch+0xd6
fffff880`0210f430 fffff880`019746be : fffff880`01127d80 0000057f`fe14fa88 fffffa80`02846040 00000000`00000000 : mpio!MPIOGlobalDispatch+0x19
fffff880`0210f460 fffff880`0198fd91 : fffffa80`02b25ea0 fffffa80`02b25ea0 fffffa80`0287c480 00000000`000007ff : CLASSPNP!ClassSendIrpSynchronous+0x4e
fffff880`0210f4c0 fffff880`0198336b : 00000000`00000000 fffffa80`02b25ea0 fffffa80`028e3010 00000000`00000000 : CLASSPNP!ClassSendDeviceIoControlSynchronous+0xe1
fffff880`0210f520 fffff880`01987d17 : fffffa80`028e3010 00000000`00000000 fffffa80`0287c5d0 fffff880`0210f6d2 : CLASSPNP!ClasspMpdevStartDevice+0x14b
fffff880`0210f5c0 fffff800`01ff18e9 : fffffa80`028e31b8 00000000`00000000 fffffa80`0287c480 fffffa80`028e3010 : CLASSPNP!ClassMpdevPnPDispatch+0x217
fffff880`0210f610 fffff880`0103da94 : fffffa80`028e3010 fffffa80`02915ce0 fffffa80`01943040 fffffa80`01ca7378 : nt!IoForwardIrpSynchronously+0x75
fffff880`0210f670 fffff880`0104732a : 00000000`00000000 fffffa80`028e3010 fffffa80`028e3010 fffffa80`02915b90 : partmgr!PmStartDevice+0x74
fffff880`0210f740 fffff800`02088fde : fffffa80`028e3010 fffffa80`0274af90 fffffa80`02915b90 fffff880`009c6180 : partmgr!PmPnp+0x11a
fffff880`0210f790 fffff800`01dc0e7d : fffffa80`0288f2c0 fffffa80`0274af90 fffff800`01dca5a0 00000000`00000000 : nt!PnpAsynchronousCall+0xce
fffff880`0210f7d0 fffff800`02098326 : fffff800`01ece940 fffffa80`02860370 fffffa80`0274af90 fffffa80`02860518 : nt!PnpStartDevice+0x11d
fffff880`0210f890 fffff800`020985c4 : fffffa80`02860370 fffffa80`01920037 fffffa80`01929b60 00000000`00000001 : nt!PnpStartDeviceNode+0x156
fffff880`0210f920 fffff800`020bbcf6 : fffffa80`02860370 fffffa80`01929b60 00000000`00000002 00000000`00000000 : nt!PipProcessStartPhase1+0x74
fffff880`0210f950 fffff800`020bc288 : fffff800`01ecc500 00000000`00000000 00000000`00000001 fffff800`01f385e8 : nt!PipProcessDevNodeTree+0x296
fffff880`0210fbc0 fffff800`01dccee7 : 00000001`00000003 00000000`00000000 00000000`00000001 00000000`00000000 : nt!PiProcessReenumeration+0x98
fffff880`0210fc10 fffff800`01cdc641 : fffff800`01dccbc0 fffff800`01fc5501 fffffa80`018e4b00 fffff800`01e702d8 : nt!PnpDeviceActionWorker+0x327
fffff880`0210fcb0 00000000`00000000 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : nt!ExpWorkerThread+0x111

STACK_COMMAND: kb

FOLLOWUP_IP:
msdsm!DsmpUpdateTargetPortGroupEntry+2b3
fffff880`012aad27 488b01 mov rax,qword ptr [rcx]

SYMBOL_STACK_INDEX: 3

SYMBOL_NAME: msdsm!DsmpUpdateTargetPortGroupEntry+2b3

FOLLOWUP_NAME: MachineOwner

MODULE_NAME: msdsm

IMAGE_NAME: msdsm.sys

DEBUG_FLR_IMAGE_TIMESTAMP: 4ce7a476

FAILURE_BUCKET_ID: X64_0xD1_msdsm!DsmpUpdateTargetPortGroupEntry+2b3

BUCKET_ID: X64_0xD1_msdsm!DsmpUpdateTargetPortGroupEntry+2b3

Followup: MachineOwner
---------

Wed Jan 16, 2013 1:13 pm

OK, it looks like we have fixed this in the new build that was just uploaded on our website.
Could you please download, install and try it.

Thank you

ghamner · Wed Jan 16, 2013 5:25 pm

YES! this has resolved the issue. I had updated to a new build a week or so ago in order to attempt to resolve, but the new-new build (5228) has resolved the issue. I gave it a pretty thorough test, connected to 3 paths, started a copy and rebooted one node. Everything continued as expected with 2 paths and the fast sync did its job when the node came back up.

Thank you very much for your support, advice and most of all patience with me in my time of need

StarWind ROCKS and is supported by excellent people.

Mon Jan 21, 2013 10:01 am

You made my day! Doing our best to keep you guys happy

ghamner wrote:YES! this has resolved the issue. I had updated to a new build a week or so ago in order to attempt to resolve, but the new-new build (5228) has resolved the issue. I gave it a pretty thorough test, connected to 3 paths, started a copy and rebooted one node. Everything continued as expected with 2 paths and the fast sync did its job when the node came back up.

Thank you very much for your support, advice and most of all patience with me in my time of need StarWind ROCKS and is supported by excellent people.