I'm just throwing this out to see what everyone thinks; I'm sure that other people will have other things that are important to them.
I think most of these are relatively simple (still doesn't mean that they will get done quickly - there may be higher priorities) but probably the stuff in section 5) is really quite major and may need to wait to version 6 or beyond.
cheers,
Aitor
1) High Availability features
- - if partner target lost completely, ability to rebuild from remaining partner without taking the target offline
- turn a non-ha img target into an HA target - without taking the target offline
- fast sync for targets using WB cache
- extend HA devices without having to recreate them or take them offline
- use MPIO for sync channel, for more redundancy and performance
- - disconnect an initiator without deleting a target
- rename a target without deleting it (with warning that initiators will be disconnected)
- change cache policy and amount of RAM allocated without deleting target or disconnecting inititators
- default window pane sizeing - top right pane height shrinks to fit, so you see more of bottom right pane by default. Same with device tab, so devices sub pane shrinks to fit, you see more of device properties
- live monitoring of:- - read, write, total bandwidth, per device and per ha partner, and per initiator
- read IOPs, write IOPs, total IOPS, per device and per ha partner
- cache hit ratio (% of reads that are serviced by WT/WB cache), with a reset button
- graphical representation of cache. A coloured coded horizontal bar, showing proportions of cache that are- - written to cache, but not yet to disk (WB only)
- written to disk
- written to disk and have been read again
- written to cache, but not yet to disk, and have been read again (WB only)
- read from disk, but not yet read again
- read from disk, and red again
- unused / expired
- - written to cache, but not yet to disk (WB only)
- ha resync - next to bar, show estimate of remaining time for rebuild to complete - - read, write, total bandwidth, per device and per ha partner, and per initiator
- - write important / critical events to windows event log
- archive / compress / delete old starwind logs
- windows perfomance counters for each target/device with info from 2)
- - storage event caputure. Basically, provide a simple command line exe that takes a string parameter, when that exe is run, string appears in a a general alert for the starwind server. This way, a user can use Windows scheduled tasks that are triggered by events recorded by their raid cards etc, and these can be bubbled up to starwind, so that administrator is aware of them. In the UI, the user can acknowledge each alert to make it go away (do this like Window Home Server does Network Health).
- a special version of above could be used for UPS alerts; on receiving the alert, Starwind could turn off WB caching on all targets
- status dump - an xml file which contains the status of everything available in the UI, updated at regular intervals (frequency set by user) for all servers being managed. This can then be queried by user's applications.
- - ability to define a target as being a cache of another. So, RAM woud be first level cache, and this on disk cache (ideally a RAID 0 or 10 of SSDs) would be second level cache. Could act as Write Back or Write Through.
- block level dedupe within targets, so identical blocks are stored once and more likely to be in 1st or 2nd level cache. Personally I would prevent over provisioning of the saved space, the objective would be just to improve performance
6) CRC / checksum support, verification
- - MS initiator supports a cyclic redundancy check on iSCSI data, but Starwind does not support this - you get an initiator error if you try to connect with it enabled. It would be nice to see this feature for really important data, as a per-target option. Intel put special instructions into the Nehalem Xeons to help speed this up, so maybe the performance impact won't be too big.
- not sure if CRC feature is enough, but another thing for really important data would be verification of every write, to detect unrecoverable read errors on the actual disks before the data is read backa again later and it's too late to do something. Without this you can get silent data corruption. This is an increasing problem with ever larger capacity drives - the bit error rate is not improving at the same pace as capacity, so there is a greater risk of you having some bad data on your disk. It's particularly a problem with 1-2TB cheap SATA drives - I even had this issue with a WD "enterprise" 2TB drive. I'm not sure if RAID (mirroring or parity) can help with this, certainly the better SAS drives have bit error rates that are orders of magnitude better, or if it's something the filesystem should be responsible for, but for really critical targets, it would be great to have starwind verify that the data that just got written to the disk is indeed the correct data, even if it kills performance. This should be optional of course!