Storage Virtualization – Part3 – Performance – Barry Whyte and Andrew Martin : IBM Storage

ORIGINALLY POSTED 22nd August 2007

14,786 views on developerworks

Cornerstone #5 The potential to increase system performance

While this topic directly corresponds with ‘Cornerstone #5’ it does also contribute #2 Simplification of storage management and #4 Increased storage utilization

Pooling and Striping

Most enterprise level controllers, the likes of DS8000, DMX and USP, and some mid-range controllers provide a degree of array pooling. That is, a method to concatenate the capacity provided by more than one RAID array into a single pool of usable capacity. Thus, luns are provisioned by carving up a chunk of this pool. One of the main reasons for providing this function is to increase the performance of the provisioned luns. Intrinsically Storage virtualization devices provide pooling abilities. However Storage virtualization devices also allow the pooling of storage across multiple controller instances, and generally a much greater degree of freedom in how luns are provisioned. Most of the large vendor products will automatically decide how to provision the luns based on the RAID type or some internal controller design point – sounds sensible. However in some cases this may mean that luns are actually carved from only one array or a subset of one array. So how does this add performance? It doesn’t. It aids the performance of the box by sticking to whats been judged to be best by the vendor (based on the design of the product), but it does not provide the same type of pooling that a virtualization device can provide.

So the key thing here is the heterogeneous striping that is possible once you have virtualized your SAN storage. All three approaches can provide you with this pooling or striping ability – but it all comes down to the heterogeniality of the chosen devices – I’ll cover interop and its importances later. Most virtualization devices provide similar striping abilities – to some degree. Here I will focus on an example using SVC – as its what I know best!

With SVC a pool is called a ‘managed disk group’ This group can contain up to 128 ‘managed disks’. Each of these managed disks is an array, so in itself is many physical spindles. The pool is internally divided into ‘extents’ – these are of variable size and can be from 16MB up to 512MB (today) – the extent size is an attribute of the pool, and is fixed for a given pool. If my understanding of the Invista documentation is correct, the equivalent extent size is variable per virtual lun. An SVC cluster can support 4M (4x 1024 x 1024) extents – so the chosen extent size does dictate the total virtualized capacity. When you create a ‘striped’ virtual disk, you specify the pool you want to use (you can also specify sub or super-sets of the mdisks in the pool to manually control the stripe-set) Anyway suppose you are creating a virtual disk that is large enough to stripe across 128 managed disks, and these were themselves 4-disk RAID-10 arrays. You now have a single virtual disk that could have the random read performance of 4×128 (512) physical disks, and random write performance of 2×128 (256) physical disks. In such internal tests using fairly old DS4300 controllers with 15K RPM drives, I have measured a single virtual disk returning over 125K random read miss operations, and over 50K random write miss operations – before the response time of the disks themselves start to head off in the usual ‘hockey stick’ manner. Now this is of course a pretty extreme example, and in general most users will stripe across maybe 4 to 8 arrays, but that’s still 4x to 8x the standard performance on random operations. My main caveat here would be that of course the disks themselves have limits (if only they didn’t!) and as you increase the number of virtual disks being carved from a single pool, you will reach some saturation point where the workload of all the virtual disks matches up with the workload that would have been possible on the single arrays. As with any storage system the overall peak performance potential is only as great as the sum of its parts – start going over that point and you will suffer, unless you can provide some buffering (cache) to cope with the busy peaks.

As I’ve already discussed in response to Hu’s questions and in my Over-alloaction post, Over-allocation is probably easier to do in a virtual lun – where you already have to store a mapping table and can just update this on the fly as new data is written to the virtual device. Otherwise controllers need to implement a method of recording which extents have and have not yet been allocated. Essentially going some way towards making it a virtualizing controller!

Latency

The specific design approach can affect the latency associated with IO requests. So lets look at why this is the case.

In-band appliance based:

Because all I/O requests flow through the device the potential to create a bottleneck out of the device is greater. The time taken to pass through the device is also directly added to the latency of the operation. Everyone can see that an in-band approach, when done badly, could add huge amounts of latency to I/O. Most vendors that have chosen alternative approaches generally slam the shutters down at this point and use this as their targeted approach to bad mouthing in-band appliances. So lets look at this a bit closer :
- In-band appliance without internal cache:
  All operations have to pass through the device to the disks, all operations can be thought of as ‘miss’ operations in conventional terms. All operations will respond only as well as the respective back-end devices. The time taken to pass through the virtualization device will be added to ALL operations, read and write. In general this approach does not scale well, and requires very swift processing in the device itself. This is why I wonder about DataCore – as far as I am aware they do not cache, but are in-band… Worst of all they run on standard operating systems – above their HAL type layers with no direct access to the I/O hardware… running your SAN through a Windows based operating system… that scares me!
- In-band appliance with internal cache:
  As above, all operations have to pass through the device, however all write operations are written to the internal cache and instantly completed back to the host. No need to go to the disks. Some reads will be serviced out of the cache. So it’s only really read-misses that have any additional latency added. SVC uses this model and internally I have benchmarked SVC code to add around 50-60us (microseconds) of latency to a read miss operation. Now even in super-fast enterprise controllers, a read miss operation will be serviced in around 5-10ms (milliseconds) I question if an additional 0.05 milliseconds of added latency would affect an application? If it does I’d be interested in what your application is doing?! Now I am simplifying slightly as we do measure what we call ‘write-miss’ measurements in-house. This is where the cache has reached its destage threshold and will start to destage to the backend-disks to allow new writes to enter from the hosts. In a well balanced and configured system, the additional work of destaging writes does not impact host writes. Its only when the back-end disks have a problem, or cannot cope with the I/O rate being driven from SVC that host I/O may be impacted as the cache cannot destage quick enough. However, this is no different from a non-virtualized SAN where a host is over-driving its allocated storage. It’s time to buy some more disks, or allocate some more to that host. The key here is the ability to monitor both the virtual disk and back-end disk performance. SVC provides industry standard XML based statistics. Storage management software such as IBM’s TPC can manipulate this data to monitor and alert on problems. We even provide a ‘cookbook’ that details how to interpret the raw data for users to build their own scripts or tools to monitor their systems if they do not have TPC.
Controller based:

Here again the I/O passes through the device, however control logic maybe hardware based, and in the USP case, its more like a switch, where the I/O is very quickly re-driven to the external attached ports through the switch ports. Essentially additional latency will be minimal in the order of nano or micro seconds. Interestingly though while USP does contain a cache, they don’t yet support caching on externally attached storage.
Switch based split-path:

Here the I/O are re-directed at line speed so there is very little additional latency added, normal switching latency which is business as usual. However for the same limitations in the line card hardware, as I discussed in part2, caching is very difficult to implement – without additional hardware.

So why a cache?

I guess one question is why do you need a cache? Some vendors will claim there is no need for a cache when you have large caches in your enterprise controllers. While this maybe partly true, not all your storage, and not all customers have enterprise controllers with huge caches.

We have proved again and again that adding SVC above most mid-range controllers which generally have a smaller cache than SVC can improve performance by a noticeable amount – it’s difficult to quantify as it will depend on the cache hit ratio across the virtual environment but something in the order of 10-20% could be expected. There is also evidence to show that even enterprise controllers with large caches are not impacted adversely by SVC, as the actual read or write operations will generally ‘hit’ the cache in the controller just as they would do without SVC fronting them. That is, SVC does not chance the caching characteristics of a controller beneath it. It should be thought of more as now having multiple levels of cache, like CPU’s do these days. SVC is your L1 and the controller is your L2. I’ve seen no evidence that the SVC cache is hampering underlying controller cache algorithms – usually it’s the other way – SVC is sending too much I/O to the back-end controllers. We recently had to add a feedback path to catch slow responding controllers and ramp down the destage rate, especially with the power available to the latest 8G4 nodes. SVC does provide the usual sequential detect / pre-fetch algorithms and again this doubles with the same type of algorithms in the controllers allowing the very high sequential throughput rates, as benchmarked by SPC-2.

It can also be seen from the above discussion that an in-band approach does benefit from having a cache for both improving performance and reducing latency. However devices without cache have to rely solely on striping abilities to enhance the performance of the existing storage systems.

As I discussed in part2, once you have fully virtualized your storage you require Copy services in the virtualization device. If you have a cache in the same device, while you do need to temporarily flush the cache to prepare to take the snap, once the point in time has been triggered there is no impact to source write operations. The cache stops the additional latency that would otherwise have been introduced by the read / merge / write operations.

One further thought with respect to online data migration. Migration services maybe acting upon data, that is in the middle of copying or moving extents. If a new write comes in for blocks on the in-flight extent(s). Without caching, the new write will be stalled until the complete extent has been migrated.

In summary

In this part I’ve covered pooling, striping and caching. I am aware that I’ve had more of an SVC slant in this part of the discussion, and it was intentional. I wanted to show why not all in-band implementations are afflicted by additional latency. Yes, SVC does add a few microseconds of latency to some operations, but in the bigger scheme of things the cache negates most of this and can provide additional performance gains – over and above and striping. The other advantages that come with sitting in the data path and so having visibility of all the data that flows through the device outweigh the minimal additional latency on read-miss operations.