HA-Device doesn't support extend

hoba · Fri Feb 16, 2018 4:51 pm

Hi,

I'm using Starwind VSAN Free. In the past I already extended exactly this HA-Image using the powershell skript ExtendDevice.ps1 to Add another 2 TB (went from 8 to 10 TB) successfully including extending the vmware vmfs6 that is stored on that image. Now I added another 10 TB and want to extend the HA-Image from 10 to 20 TB. This time, when I run the same skript (just changed the amount of space inside the skript to add another 10485760 MB) it tells me:

Code: Select all

 Error: This device doesn't support extend.

The two nodes are in sync just fine and everything is working like expected, I just can't do the extend. Is it possible, that the first run of the extend-skript a few weeks ago changed some parameters, so the device is not extendable any more? And if so, how can I change that back?

Any help appreciated!

Regards
Holger

Fri Feb 16, 2018 4:57 pm

What StarWind build do you use? With 11818, there may be some issues with the StarWindX library interaction.

hoba · Fri Feb 16, 2018 5:05 pm

Actually yes, it's 11818. However, it was the same version, with which it worked before. So updating should help?

Thanks for the quick reply!

Mon Feb 19, 2018 11:28 pm

hoba,

I would recommend creating another 10TB device instead of extending the current 10TB one to 20TB. Based on your storage characteristics, with a 20TB device it may be taking a while till its synchronization is over in those cases when it goes out of sync.

hoba · Tue Feb 20, 2018 1:34 pm

Hi Boris,

is there a way to make it work? I disabled first one, then the other node for about 10 hours each (see https://forums.starwindsoftware.com/vie ... f=5&t=4927), while all the VMs on the SAN were still up and running (I disabled backups and so on to keep the delta on the VMFS low) but the sync worked out pretty well. Only took less than an hour. Also, if one of the 10TB imagefiles is out of sync due to downtime of a node, the other 10TB image will be out of sync too, so it would have to sync 2x 10TB instead of 1 times 20TB. Or am I wrong? The Imagefiles will be located on the same RAID10 anyway, so I don't think it makes a difference regarding the performance.

I have seen that 11818 is still the most recent version. Do you know when the next version will be out and if it fixes the issue? Or maybe you can thing of some kind of workaround to make the extend possible again?

If you need further information or details on the setup, let me know.

Thanks for your great support!

Holger

hoba · Tue Feb 20, 2018 2:20 pm

I just noticed something a bit strange, that might be related. When running enum-Powershellskript there seems to be a wrong path:

Local Server:
--------------------------------------------------------------------------------
Targets:

Name : iqn.2008-08.com.starwindsoftware:vsan01-vsandatastore1
Id : 0x000001C1F70D3A40
Alias : vsandatastore1
IsClustered : True
Devices : System.__ComObject
Permissions : System.__ComObject
type :

Devices:
Name : HAImage1
DeviceType :
DeviceId : 0x000001C1F709D480
File : My computer\E\vsandatastore2\ImageFile2_HA.swdsk
TargetName : empty
TargetId : empty
Size : empty
CacheMode : wb
CacheSize : 16
CacheBlockExpiryPeriod : empty
Exists : True
DeviceLUN :
IsSnapshotsSupported : False
Snapshots :
SectorSize :
State : 1

Name : imagefile1
DeviceType : Image file
DeviceId : 0x000001C1F6A7EBC0
File : My Computer\E\vsandatastore1\vsandatastore1.img
TargetName : empty
TargetId : empty
Size : 10995116277760
CacheMode : wb
CacheSize : 16384
CacheBlockExpiryPeriod : 5000
Exists : True
DeviceLUN :
IsSnapshotsSupported : False
Snapshots :
SectorSize : 512
State : 0

Remote Server:
--------------------------------------------------------------------------------
Targets:
Name : iqn.2008-08.com.starwindsoftware:192.168.181.51-vsandatastore1
Id : 0x000002CE08DA3380
Alias : vsandatastore1
IsClustered : True
Devices : System.__ComObject
Permissions : System.__ComObject
type :

Devices:
Name : HAImage1
DeviceType :
DeviceId : 0x000002CE08DA4B00
File : My computer\e\vsandatastore2\ImageFile2_HA.swdsk
TargetName : empty
TargetId : empty
Size : empty
CacheMode : none
CacheSize : 64
CacheBlockExpiryPeriod : empty
Exists : True
DeviceLUN :
IsSnapshotsSupported : False
Snapshots :
SectorSize :
State : 1

Name : imagefile1
DeviceType : Image file
DeviceId : 0x000002CA05F95600
File : My Computer\E\vsandatastore1\vsandatastore1.img
TargetName : empty
TargetId : empty
Size : 10995116277760
CacheMode : wb
CacheSize : 16384
CacheBlockExpiryPeriod : 5000
Exists : True
DeviceLUN :
IsSnapshotsSupported : False
Snapshots :
SectorSize : 512
State : 0

All the files are located on e:\vsandatastore1\ and the swdsk-files are e:\vsandatastore1\vsandatastore1_HA.swdsk . In fact, there is no vsandatastore2 on drive e. Also, there seems to be a difference in the CacheSize-setting. I'm not sure if this was the case before I ran the extend the first time. Everything else seems to work normal though.

Maybe that misconfiguration is related? I only used the provided powershellskripts to work with the imagefiles though...

hoba · Tue Feb 20, 2018 3:21 pm

Just reviewed the starwind.cfg and I think there is something wrong as well:

...
<devices>
...
<device file="My Computer\E\vsandatastore1\vsandatastore1.swdsk" node="-1" name="imagefile1"/>
<device name="HAImage1" OwnTargetName="iqn.2008-08.com.starwindsoftware:vsan01-vsandatastore1" file="My Computer\E\vsandatastore1\vsandatastore1_HA.swdsk" serialId="FD9D5EBB682ACEC2" asyncmode="yes" readonly="no" highavailability="yes" buffering="no" header="65536" reservation="no" CacheMode="wb" CacheSizeMB="16384" CacheBlockExpiryPeriodMS="5000" AluaNodeGroupStates="0,0" Storage="imagefile1"/>
<device name="HAImage1" OwnTargetName="iqn.2008-08.com.starwindsoftware:-havsandatastore2" file="My computer\E\vsandatastore2\ImageFile2_HA.swdsk" serialId="1b06b4a4fc0e43d58b5e4e9d3c85fe" asyncmode="yes" readonly="no" highavailability="yes" buffering="no" header="65536" reservation="no" CacheMode="wb" CacheSizeMB="16" AluaNodeGroupStates="0,0" Storage="imagefile3" PoolName="vsandatastore2"/>
</devices>
...
<targets>
<target name="iqn.2008-08.com.starwindsoftware:vsan01-vsandatastore1" devices="HAImage1" alias="vsandatastore1" clustered="Yes" node="-1"/>
</targets>
...

This looks, like there are 2 definitions for HAImage1, one that is correct, the other one, that's not even existing anymore. I had a second datastore earlier, when doing the evaluation of the san, but it got removed a long time ago. Also I only worked with the powershellskripts to set everything up, not with the trial management console.

Am I safe to shutdown the services on both nodes, delete the false HAImage1-line from the cfg-file and startup services again or is there more to it?

Thanks in advance,
Holger

Tue Feb 20, 2018 4:53 pm

It looks like you have got a phantom entry for the non-existing device. What was the build you used before 11818?
You can stop the service on the node in question, delete the orphaned entries and start the service back. Be sure to make a backup of the StarWind.cfg before you do anything with it.

As far as synchronization and device size is concerned, here is why I advised to have 2x10TB devices instead of 1x20TB. While one side of the HA device is not synchronized, this side does not accept read/write operations from the clients. This means all requests will be processed by one node only. In turn, this means that getting one 10TB device synchronized will immediately improve performance of that device notwithstanding the remaining 10TB disk. But in case of having only 1 device 20TB large, all client operations on the non-synchronized side will be blocked until the device is fully synchronized. Which means you will have to load only one node with those operations for a considerably longer period of time. But it is up to you to decide on your architecture.

hoba · Wed Feb 21, 2018 1:52 pm

Hi Boris,

I'll try to get the starwind.cfg right with the next maintenance schedule. Thanks for the advice!

The version that was installed prior to 11818 was 11456 (initial install with that version).

Not sure if I'll run 2x 10TB or 1x 20TB in the end but I'll report back how everything worked out.

Btw: I love the "simplicity" and "transparency" or "human readability" of the configuration of starwind vsan. Good job!
On anothersSidenote, can you show me an example starwind.cfg with smtp-settings-block so I can setup mailnotifications on errors?

Regards
Holger

Wed Feb 21, 2018 3:38 pm

Code: Select all

<reactions>
    <reaction maskSeverity="12" maskCode="254" type="smtp" smtpHost="smtp.domain.com" smtpPort="25" recepient="me@domain.com" mailFrom="starwind@domain.com" subj="Event notification"/>
</reactions>

Here:
Mask severity 12 equals to "Errors and Warnings"
Mask code 254 equals to "All sources"
Everything else speaks for itself.

hoba · Thu Feb 22, 2018 5:39 am

Thanks again!

Thu Feb 22, 2018 11:04 am

You are welcome.

hoba · Tue May 29, 2018 11:50 am

It took some time to find a slot for the maintenance... I shut down all vmware-servers that were accessing the VSAN and shut down the starwind services on both nodes. Then I corrected the cfg.files on both nodes. I also added some more heartbeatinterfaces by editing the swdsk-files on both nodes (I read somewhere that it is better to have several interfaces to prevent split-brain-situations, I think it was somewhere in your FAQ but I can't find it anymore).

Code: Select all

       <storage id="3" name="iqn.2008-08.com.starwindsoftware:vsan01-vsandatastore1" type="remote" lun="0x0">
          <transport type="iSCSI">
            <links>
              <link id="1" type="data" priority="1" connections="1">
                <peer ip="10.254.251.1" port="3260"/>
              </link>
              <link id="2" type="data" priority="1" connections="1">
                <peer ip="10.254.250.1" port="3260"/>
              </link>
              <link id="3" type="control" priority="1" connections="1">
                <peer ip="192.168.181.50" port="3260"/>
              </link>
              <link id="4" type="control" priority="1" connections="1">
                <peer ip="10.254.250.1" port="3260"/>
              </link>
              <link id="5" type="control" priority="1" connections="1">
                <peer ip="10.254.251.1" port="3260"/>
              </link>
            </links>
          </transport>
        </storage>

I added link id 4 and 5 on both nodes (the directly connected sync-interfaces between the nodes with no switch in between).

After I started the starwindservices on both nodes again everything came up fine. I tried to do another extend of the device which now was quit by a message that it was unsuccessful (different message as it previously said device doesn't support extend). After running the command a sync was triggered which took some time. It didn't seem to be a full sync though, as it was done in about 1.5 hours.

The system is up and running and all paths of the vmwareservers to the vsan are active, however something is strange now: Both nodes show status synced but one of the nodes shows 0%, the other 100%.

Node1:

...
highavailability="yes"
ha_serialid_string="FD9D5EBB682ACEC2"
ha_synch_status="1"
ha_synch_percent="0"
ha_synch_type="0"
ha_sync_elapsed_time="0"
ha_sync_estimated_time="0"
ha_priority="0"
ha_is_node_removed_from_partners="no"
ha_is_storage_extend_supported="yes"
ha_is_storage_snapshot_supported="no"
ha_is_storage_device_ready="yes"
ha_is_storage_device_readonly="no"
ha_is_SMISHidden="no"
ha_autosynch_enabled="yes"
ha_wait_on_autosynch="0"
ha_auto_sync_priority="1"
ha_maintenance_mode="0"
ha_sync_traffic_share="25"
ha_alua_group_node_state="0"
ha_tracker="no"
ha_tracker_frozen="no"
ha_tracker_snapshots_storage=""
ha_tracker_mount_time="0"
ha_tracker_mount_snapshot=""
ha_tracker_status="-1"
ha_tracker_pending="0"
ha_tracker_replicated="0"
ha_tracker_replicating="0"
ha_tracker_scheduled="0"
ha_node_type="1"
...
ha_partner_nodes_count="1"
ha_failover_config_type="0"
ha_partner_node1_target_name="iqn.2008-08.com.starwindsoftware:192.168.181.51-vsandatastore1"
ha_partner_node1_priority="1"
ha_partner_node1_type="1"
ha_partner_node1_storage_device_type="ImageFile"
ha_partner_node1_sync_channels="10.254.251.2$3260$1;10.254.250.2$3260$1"
ha_partner_node1_heartbeat_channels="192.168.181.51$3260$1;10.254.250.2$3260$1;10.254.251.2$3260$1"
ha_partner_node1_is_exist_sync_valid_connection="1"
ha_partner_node1_is_exist_heartbeat_valid_connection="1"
ha_partner_node1_sync_status="1"
ha_partner_node1_sync_percent="100"
ha_partner_node1_sync_type="0"
ha_partner_node1_sync_elapsed_time="0"
ha_partner_node1_sync_estimated_time="0"
ha_partner_node1_tracker_frozen="no"
ha_partner_node1_tracker_snapshots_storage=""
ha_partner_node1_tracker_mount_time="0"
ha_partner_node1_tracker_mount_snapshot=""
ha_partner_node1_auth_chap_type="None"
ha_partner_node1_auth_chap_login=""
ha_partner_node1_auth_chap_password=""
ha_partner_node1_auth_mutual_chap_name=""
ha_partner_node1_auth_mutual_chap_secret=""
...

Node 2:

...
highavailability="yes"
ha_serialid_string="FD9D5EBB682ACEC2"
ha_synch_status="1"
ha_synch_percent="100"
ha_synch_type="0"
ha_sync_elapsed_time="0"
ha_sync_estimated_time="0"
ha_priority="1"
ha_is_node_removed_from_partners="no"
ha_is_storage_extend_supported="yes"
ha_is_storage_snapshot_supported="no"
ha_is_storage_device_ready="yes"
ha_is_storage_device_readonly="no"
ha_is_SMISHidden="no"
ha_autosynch_enabled="yes"
ha_wait_on_autosynch="0"
ha_auto_sync_priority="1"
ha_maintenance_mode="0"
ha_sync_traffic_share="25"
ha_alua_group_node_state="0"
ha_tracker="no"
ha_tracker_frozen="no"
ha_tracker_snapshots_storage=""
ha_tracker_mount_time="0"
ha_tracker_mount_snapshot=""
ha_tracker_status="-1"
ha_tracker_pending="0"
ha_tracker_replicated="0"
ha_tracker_replicating="0"
ha_tracker_scheduled="0"
ha_node_type="1"
...
ha_partner_nodes_count="1"
ha_failover_config_type="0"
ha_partner_node1_target_name="iqn.2008-08.com.starwindsoftware:vsan01-vsandatastore1"
ha_partner_node1_priority="0"
ha_partner_node1_type="1"
ha_partner_node1_storage_device_type="ImageFile"
ha_partner_node1_sync_channels="10.254.251.1$3260$1;10.254.250.1$3260$1"
ha_partner_node1_heartbeat_channels="192.168.181.50$3260$1;10.254.250.1$3260$1;10.254.251.1$3260$1"
ha_partner_node1_is_exist_sync_valid_connection="1"
ha_partner_node1_is_exist_heartbeat_valid_connection="1"
ha_partner_node1_sync_status="1"
ha_partner_node1_sync_percent="0"
ha_partner_node1_sync_type="0"
ha_partner_node1_sync_elapsed_time="0"
ha_partner_node1_sync_estimated_time="0"
ha_partner_node1_tracker_frozen="no"
ha_partner_node1_tracker_snapshots_storage=""
ha_partner_node1_tracker_mount_time="0"
ha_partner_node1_tracker_mount_snapshot=""
ha_partner_node1_auth_chap_type="None"
ha_partner_node1_auth_chap_login=""
ha_partner_node1_auth_chap_password=""
ha_partner_node1_auth_mutual_chap_name=""
ha_partner_node1_auth_mutual_chap_secret=""
...

Btw, the output is from telnetting to the managementinterface.

Performance seems to be fine and as I said, all paths are active. However it doesn't fell "right" to see it this way. Any kind of advice is appreciated.

Thnaks
Holger

Thu Jun 07, 2018 2:40 pm

Holger,

Based on the information I was able to get from the development team, as long as ha_synch_status equals to 1, the ha_synch_percent value does not matter. This is just a "cosmetic" difference, if I may call it so.

hoba · Tue Jun 12, 2018 4:55 am

Thank you Boris, I feel better now