Some LSFS mounting issues

Software-based VM-centric and flash-friendly VM storage + free version

Moderators: anton (staff), art (staff), Max (staff), Anatoly (staff)

Post Reply
velis
Posts: 9
Joined: Mon Feb 09, 2015 12:12 pm

Fri Sep 09, 2016 7:13 pm

Using VirtualSan Free on Windows 10 Pro. Underlying storage drives are the NAS variant (WD Red)
I have set up two targets, each with one 2TB volume, 512b sector size. One has L2 write-back cache (980GB storage size), the other only has L1 cache (67GB storage size).

I have noticed that *sometimes* when I restart the computer, the images are mounting for a long time. Time of mounting depends on data stored in the image. Today it took VirtualSAN 2 h 15 minutes to mount the larger image. During the mounting process, first the underlying storage was @ 100% usage, followed by a 4 -5 minutes 100% usage of one CPU core, followed by another session of storage at 100% and a short (<2min) CPU burst.

While the disk was @ 100%, its transfer rate was a puny 1 - 2 MB/s, with response time below 2ms. While CPU was @ 100%, only one core was taxed.

Question #1: Is this normal? Why would the images sometimes mount "immediately" and sometimes "the long way"? Shutdowns are always clean, the entire thing is on UPS.
Question #2: Would using a 4KB sector size help with mounting speed?

I have been testing the entire setup with excellent results for the past 45 days. Just yesterday I decided to re-purpouse the original physical drive and move 100% to VirtualSAN as my storage of choice. As a result, I didn't have a 100% backup any more (I only back up the important stuff, the rest would have to be reinstalled in case of failure). Anyway:

Today, after the long mount, everything seemed fine. I did a file copy procedure which would copy some 40GB off the iSCSI volume. In the middle of operation, my client computer became unresponsive. After that, no reboot helped: the computer would always be (mostly) unresponsive. The iSCSI volume sometimes mounted, sometimes it did not. So I restarted the server.
Following another long mounting session, the volume now mounts fine at the client.

Question #3: This was a huge scare for me. Is there a reasonable explanation for this? I have archived the logs in case they might be useful and I would very much like to know what exactly caused this state.
Al (staff)
Staff
Posts: 43
Joined: Tue Jul 26, 2016 2:26 pm

Mon Sep 19, 2016 4:17 pm

Hello Velis,

Thank you for contacting us.

Answering your questions:

#1. StarWind service took longer to stop than Windows WaitToKillService timeout, that's why you can receive longer mount time after a clean shutdown.
#2. Using 4Kb sector size will increase performance of your LSFS devices, but would not have any influence on mounting speed.

We are working hard to increase LSFS mount speed. We have beta to test it.

Also, it would be great to see logs from your systems. Could you please submit case on StarWind website and attach logs to it?
https://www.starwindsoftware.com/support-form
velis
Posts: 9
Joined: Mon Feb 09, 2015 12:12 pm

Fri Sep 23, 2016 5:54 am

I wanted to send logs, but the form rejects my gmail.com email address. I don't currently have another account that would not be with a "general public" provider.
Al (staff)
Staff
Posts: 43
Joined: Tue Jul 26, 2016 2:26 pm

Tue Sep 27, 2016 4:52 pm

Hello Velis,

I have just writen you to PM.
tftp_jcp
Posts: 2
Joined: Mon Jan 04, 2016 9:11 am

Tue Oct 11, 2016 7:47 am

I am experiencing the same issues as the user above reports. Has there been any further progress on the slow mounting issue?

We are currently using StarWind on a non-production role, but are considering your commercial offering for a wider Virtualisation Project so this issue naturally concerns me.
Al (staff)
Staff
Posts: 43
Joined: Tue Jul 26, 2016 2:26 pm

Mon Oct 17, 2016 8:17 pm

Hello Tftp_jcp,

Could you please submit support case here?

Thank you!
Anton Chernomazov
Posts: 17
Joined: Thu Mar 17, 2011 3:23 pm
Location: Russia, Ivanovo
Contact:

Wed Dec 28, 2016 4:22 pm

I have 1 TB LSFS target w/dedupe. User data size - 770 GB, allocated data size - 356 GB, metadata size - 3,36 GB. L1 cache - 512 Mb. Files are placed on RAID 10 (4 disks 1,8 Tb SAS 10k RPM).
I stopped and started Starwind service to check mounting time.
Mounting time - 30 min. It's too much for VM datastore. 5-6 targets like that, and we have to wait about 3 hours (!) to start VM infrastructure!
Average CPU load - 4 %. Median disk read speed - 4 MB/s. Peak disk read speed (about a minute at the end) - 460 MB/s. A few minutes at the beginning, CPU load was a little higher, and disk load was very low.
What should I do? Reject LSFS? Reject deduplication? Submit support case?
Al (staff)
Staff
Posts: 43
Joined: Tue Jul 26, 2016 2:26 pm

Fri Dec 30, 2016 3:19 pm

Hello Anton,
LSFS device mounting time depend on the amount of data located there.
As I mentioned above our R&D department already works on it. We are waiting for this improvement in next couple builds.
In case if it is critical for you, I would recommend you to migrate the data to Image devices.
Anton Chernomazov
Posts: 17
Joined: Thu Mar 17, 2011 3:23 pm
Location: Russia, Ivanovo
Contact:

Tue Feb 14, 2017 5:33 am

I tested new 10547 build.
LSFS mounting time was dramatically reduced down to 30-40 seconds. Such timings seem acceptable for us for production use.
Vitaliy (Staff)
Staff
Posts: 35
Joined: Wed Jul 27, 2016 9:55 am

Tue Feb 21, 2017 12:00 pm

Thank you for the update. Our R&D team still working on some other improvements, thus it will be better on next builds.
raphaf2001
Posts: 4
Joined: Sat Jan 05, 2019 3:19 am

Sat Jan 05, 2019 3:35 am

Clearly this has been an ongoing issue for quite some time. I have been testing the free version of the product in my private sandbox environment (VMs), so I can decide if it is mature enough for the resiliency required in a production/profession/paid version environment, and apparently it is not. I commend the developers for an interesting and potentially very interesting product, but if you can't guarantee a product that can withstand all possible hardware and software issues that happen inherently to deployments everywhere (e.g., power outages, faulty hardware/software, etc.) you guys need to go back to the drawing board.

In my case: At one point I got the lsfs drive refusing to mount. And, to my surprise, there was nothing I could do about it. No recovery. No option. No check. No "repair". Nothing. So, if something goes wrong with a virtual vSAN of yours (an lsfs drive), what are we supposed to do? Open a support ticket? I'm sorry but that's not enough. Your company needs to provide testing/recovery tools that can effectively check and repair these on the spot. Clients should not be required to rely on support tickets to resolve such basic issues.

I did have a snapshot of the VM I could revert back to, and I am back to regular operations. And I will continue testing, but this first "downtime" gave me pause to recommending the product to my institution.

Raphael. :shock:
Boris (staff)
Staff
Posts: 805
Joined: Fri Jul 28, 2017 8:18 am

Tue Jan 08, 2019 7:10 pm

Raphael,

As you have pointed it out, what you have described in terms of LSFS behavior is an ongoing issue, which we pay attention to. We are constantly working on improving performance and stability of LSFS under the conditions of our customers searching for more exquisite ways of making it fail. As you know, even with T1000 there appeared a way to destroy it. Well, during support cases we offer our customers some internal utilities that help restoring the LSFS structures.
As far as your case is concerned, it is highly likely that the service had not stop properly before you experienced that long mounting time. If the service shutdown is not clean, at the next service start LSFS mounts all snapshots, and this does take time depending on how many of those you've got. Generally, it would be great if you refer to any certain build when complaining about long mounting time and describe the situation that led to it.
smcclos
Posts: 2
Joined: Wed Oct 30, 2019 6:29 am

Wed Oct 30, 2019 6:59 am

I found a solution to the problem that worked for me. The .000.spvmap file was fragmented. I did the following steps:

-Stop the starwind service
-defrag file *.000.spvmap with systernals tool contig
-Start the starwind service

My startup time went from 4 - 5 minutes to less than 30 seconds.
LUN size: 1.03 TB
Disk Space: 727 GB
Version: 8.0.0.12146
Boris (staff)
Staff
Posts: 805
Joined: Fri Jul 28, 2017 8:18 am

Fri Nov 01, 2019 2:27 pm

I have had a conversation with the team and it looks like your finding has been confirmed. It may be used for its intended purpose, yet this belongs to the category of workarounds rather than official tweaks.
Post Reply