ORIGINALLY POSTED 13th November 2007
8,111 views on developerworks
Last week was a long one. The main SVC test-lab in Hursley suffers from the same problems that haunts most machine rooms – lack of power and not enough cooling. Thats not to say that we don’t have a lot of both, just that as the support and test requirements have increased over the last five years, so has the lab footprint. We are now at the limit of the power and chilled water supply in the main building, so have been migrating some stands over to a second lab in another building. 18 months ago we moved my performance stand which was no small task in itself, especially with several p690 based ‘Regatta’ servers. This time its what we call the ‘1024 host stand’ – used to test maximum fabric connectivity etc. After several days of fibre cable detangling – KVM decabling and the removal of the ‘top-hats’ in the 19″ racks (the doors between our lab and Goods In are too low for the racks to be moved wholesale!) – I was beat. The old jokes of ‘programmers hands’ and ‘a bit of hard work’ spring to mind… and we’ve got the joy of putting it all back together again over the next few weeks!
Anyway, I’ve been switching off when I got home and avoiding the laptop so here is the overdue second part of the 4.2.1 techincal details discussion.
Lets start with the simple one, addressability. Users upgrading to v4.2.1 will be able to attach up to 8PB of managed storage, 4x the previous limit. Note also that this is a true binary PetaByte. The addressability of an individual SVC cluster is related to the chosen ‘extent’ sizes across its Managed Disk Groups. Each Mdisk Group is assigned an extent size when it is created, this is essentially the size of the building blocks of storage used to create Virtual Disks. SVC internally maintains 4Meg extents (4x1024x1024) in its virtualization map. Therefore choosing a smaller extent size, 16MB will result a smaller addressable space than choosing the new 1024MB or 2048MB extents.
Sharing or available resources is always a good idea, its one of the cornerstones of Storage Virtualization. With any shared resource however you run the risk of one or more greedy users taking more than their fair share of the resource, usually to the detriment of other users.
This situation can arise in a shared storage environment. This time its a greedy host or greedy storage controller that can cause the problem. In the former case, one greedy host can run away with the available SAN bandwidth and while it may see good response times and throughput, a smaller less hungry host will be fighting for its share of the available resources. The less greedy host may suffer from a lack of throughput and/or increased response times. The latter case is actually introduced by sharing, pooling or striping of disk resources. All is fine if the shared resources are equal, but when one is slower or faster than another, an interesting inverse law applies. Lets take a look at SVC as an example.
SVC’s cache was designed to work on the boundary of an IO Group (a pair of nodes and a collection of Virtual disks serviced by those nodes) All writes are mirrored between the two nodes before completion is sent back to the host. The IO Group itself provides a nice isolation boundary between different applications, workloads and can be used to solve the problem of a ‘greedy host’ – especially competing overnight batch workloads. This is one of the main reasons why we didn’t make the cache N-way across all nodes in a cluster, a single globally shared cache, as is the case with HDS’s USP design, and despite Hu Yoshida telling us otherwise, a single global cache will suffer more from this problem – unless some level of logical separation is available. (This is probably one of the main reasons why they don’t support/recommend using cache on external storage)
However cache is actually more influenced by ‘greedy disks’. What is a ‘greedy disk’ though? Here I actually mean a slow disk or controller, at least slower than others in use by the same cache. Imagine you have one controller that is getting more than its fair share of I/O requests. This controller happens to also be a great deal slower than others. You have a single shared cache that is over-arching all of these controllers. The cache will do a great job of buffering the I/O requests for the slow controller – up to a point (i.e. it gets full)! Ultimately the cache is at the mercy of the destage rates it can sustain to the actual disks / controller. In the situation where the I/O load is greater than can be sustained by the controller, or the controller suffers a temporary problem (loses its caching ability through some hardware or software failure) then you could end up caching all this I/O – leaving little or no cache space for the non-greedy controllers.
So SVC has always provided “physical partitioning” or IO Group cache separation and now v4.2.1 sees the addition of “logical partitioning” within each IO Group, commonly known as Cache Partitioning. The logical partition space is monitored and controlled at the Managed Disk Group level.
The configuration guidelines for Managed Disk Groups recommend that only like disks should be grouped together. That is, the same RAID, RPM of spindles and generally controller type and/or class. In essence a Managed Disk Group belongs to a Tier. Performing Cache Partitioning at this logical Tier level therefore makes most sense. The code internally sets limits on how much cache space can be consumed by the set of Virtual disks, by virtue of their association with a given Managed Disk Group. For example, with more than 4 Managed Disk Groups defined, no one group can consume more than 30% of the available cache resources. It should be noted that only write data is policed. Read cache data is unaffected by this new feature; primarily because read data cannot suffer from the same downstream destage problems so does not need to be limited. In general processing terms this does not impact everyday performance, other than when a partition is being “limited” – that is, when new data does not enter the cache until some dirty write data has been destaged to disk. However, if this is happening it suggests that the workload being driven to those virtual disks exceeds the capability of the back-end storage being used.
This partitioning scheme adds to the protection and performance characteristics of the SVC cache, remember its only when something has gone wrong on the storage controller, or a controller is being over-driven, that a partition will be limited, both situations that are best to avoid in general but do sometimes occur.
In testing this feature I setup numerous configurations with ‘good’ and ‘bad’ controllers. The results are very noticeable in cases where one controller suffered from a serious failure (disabling controller based caching on a 15+P array that was missing one of its member disks – a double failure really) I was driving this array at or over its capabilities even before the failure itself and continued to attempt to submit this same rate after the failure. In such cases, comparing with non-partitoned code there was more than 2x improvement in total IO Group throughput.
One final positive performance gain came from enhancing the destage algorithms that are used to decide what to destage from the cache. The new algorithms will spot that a controller is being over-driven far quicker than was previously possible. Under normal operation a heavy write workload will also be detected quicker and can destage for a given partition even if the cache as a whole is not over the normal destage thresholds. Here I measured a small but noticeable improvement in I/O response time under what I refer to as ‘write miss’ benchmarking (writes to cache at such a rate as to require continuous destage to the backend disks.
In the third and final part of this series I plan to summarise the smaller DCR’s (Design Change Request) that made it into the 4.2.1 release – those that are of interest to users anyway. I aim to have this up by Friday when SVC 4.2.1 code will be available for download.