SVC 5.1 GA – Zero Detection and Cache Tweaking – Barry Whyte and Andrew Martin : IBM Storage

ORIGINALLY POSTED 10th November 2009

15,995 views on developerworks

The kids very kindly passed on their latest ‘bring-home’ from school – a stinking cold! Been feeling pretty rotten since I got back from SNWE in Frankfurt a couple of weeks back – so hopefully I didn’t pass it on to anyone I met there! Was a great couple of days, thanks to everyone that came along for my keynote, was good to see so many people interested, and I only noticed one person who’d fallen, or was falling asleep 🙂

Last Friday SVC 5.1.0 was made generally available – you can download the code now from the support web pages, and any new nodes shipped from now on, either 8A4 or CF8 come pre-installed with the 5.1.0 code. This has been a real labour of love for the worldwide SVC development and test teams and I pass on my thanks and congratulations to everyone involved in what has been probably the biggest point release since 1.1.0 itself.

The interest in the new SVC+SSD node hardware as been amazing since we announced this at the beginning of the month. I covered a lot of this in my last few posts, but here are some pictures – taken from the online service procedures.

Fig 1. (above) The new detachable display panel
Fig 2. (below) The detached display panel, with 4 blanks for SSD
Fig 3. (below) The planar, with 8Gbit FC card (top) and optional SAS card (bottom)
Fig 4. (below) Pre-GA STEC ZeusIOPs 146GB SSD

Anyway, now its time to look at the enhancments to the SVC 5.1 software itself.

Space-efficent Enhancements – Zero Detection

The code has evolved over the six and a bit years since inital release, and with the 24GB of cache space and increases in virtual memory addressability we had to move to a 64-bit kernel base. This does however mean that for the first time ever we’ve had to limit the existing node hardware that can be upgraded to 5.1. Namely the original 2003-2005 4F2 node hardware which contain 32-bit Xeon processors. These will be going end of service in December 2010 anyway, but we will continue to support the 4.3.1 code base on these nodes until then. If you want to upgrade to gain iSCSI support, the cache code enhancements or the copy services updates in 5.1 then you will need to look to upgrade your hardware too. If you are in this position, and don’t necessarily need the performance gains that the CF8 nodes provide, then maybe the Entry Edition 8A4 nodes are worth a look. They were released this time last year and provide about 2x the performance of the original 4F2 nodes (plus twice the cache and reductions in latency due to 4Gbit FC). For those of you looking to upgrade both the hardware and the software at the sametime, that is moving from 4F2 to CF8 hardware non-dsruptively, there is a procedure on the way by which you can start the upgrade to 5.1 and replace each node in turn during the software upgrade. (This has to be done as the 4F2 nodes won’t run 5.1 and the CF8 nodes can only run 5.1 – catch22).

The 64-bit move does however enable us to make use of the Intel SSE features of all the 64-bit Xeons. SSE (Streaming SIMD Extensions (SIMD = Single Instruction Multiple Data)) allows us to perform ‘offload style’ functions, like the Zero Detection for Space-efficient Vdisks. By writing this function in direct hardware manipulation mode, i.e. Intel Assembler, we can provide a macro style very high throughput offload engine. We’ve measured more than 24GB/s memory bandwidth on each CF8 node, thats memory to CPU and back. Given that the 8Gbit fibre channel ports per node allow for 6.4GB/s that’s plenty of memory bandwidth to enable such functions, as well as all the other SVC advanced functions. An interesting aside on the Zero Detect function is that for all zero blocks, we can double write bandwidth. Since SVC mirrors its write cache between nodes in an I/O group, when we detect and mark an I/O block (up to a contiguous 32KB cache track – actual I/O sizes can be larger or smaller) then all through the SVC code stack we know its a zero block. Thus we don’t need to mirror this to the partner, we just pass a reference, hence we don’t consume bandwidth for the cache mirror traffic. This makes, for example, an Oracle DB format look more like a read hit test, and we can achieve maximal data throughput rates.

Zero Detection is implemented in two places, first as a way to go from ‘thick’ to ‘thin’ or fully-allocated to Space-efficient using Vdisk Mirroring, and secondly to ‘keep thin’ once you have a Space-efficient Vdisk, by not writing zero blocks even if the using system does. The latter function is only available on the new CF8 hardware as every incoming write to a Space-efficient vdisk is scanned by the Zero Detect macro. While we have made gains of up to 145% throughput and halving of response time on CF8 writes (when compared to 8G4) the impact to scan writes to Space-efficient vdisks is not noticeable. On older node hardware there was more of a noticeable impact, and the last thing we’d want to do was change your known I/O performance (for the worse) when upgrading, so the decision was taken to only provide the latter mode of Zero detect on the new Nehalem based Xeon nodes.

Cache Algorithm Enhancements

As you can imagine we have various internal tools for live monitoring of the internal structures, data flow etc inside SVC in the lab. I seem to spend a lot of time staring at one or two pages in particular – the cache overall state. Since I took over the role as performance architect for SVC, the cache (as you can imagine) is one of the key areas where we can optimise SVC’s performance. Given that SVC is essentially at the mercy of the performance of the storage you put behind it, and its consolidating I/O across many different controller types, we’ve been adding more and more intelligence to fine tune every customers installation dynamically. Imagine you have a tiered environment, your low end SATA controllers can’t possibly handle the same I/O rate as your large cached, enterprise controllers. Yet SVC doesn’t really know what storage you have behind each box. It could be that you have something like XIV, where SATA doesn’t behave like SATA, or you may only have a small number of spindles attached to your enterprise box. In other words, every installation is different. The only real way to be sure is to monitor and feedback to the cache what’s working and what’s not.

Back in 4.2.0 we added some basic performance metrics that did just that, for writes, we can vary the destage rates based on the expected response time when compared with the actual response time. The goal here is to keep the disks and controllers as busy as possible, but maintain good response time. Ultimately, if you try to drive more I/O than the controller (its cache) and disks can handle, then somewhere queueing has to happen. Usually this would happen in a server itself – as queue depths get reached. However with SVC capable of accepting many more concurrent I/O than an average storage controller, this could mean that SVC can easily overload said controllers. For writes we therefore maintain a careful balance to keep response time at the disk system layer as good as possible.

Back in 4.2.1 we added partitioning of the cache, thus ensuring we don’t allow a single overloaded controller to run away with all the SVC cache. We can when necessary cap the amount of write data that the SVC cache can itself hold for any given controller, to ensure fairness, and basically make sure one overloaded controller doesn’t slow down / impact another well behaved controller.

Now with 5.1.0 we have gone a step further having monitored and analysed how such a heterogeneous environment behaves as a whole system, over the last few years it became clear that controllers don’t like bursts of I/O. They like to be steadily busy, or for workloads to gradually ramp up and down. Now the real world isn’t like that, but in the SVC world it can be. We have enough cache (now up to 192GB per cluster) that we can smooth out these workloads and provide the downstream controllers with the workload they like best. So we’ve smoothed out the ramping as we transition between workloads, gradually upping destage rates, monitoring performance as we go, and based on a huge number of inputs, rules and measurements we can drive the cache management algorithms in a way that not only suits the current workload, but the attributes of the environment in which each instance of SVC finds itself. I actually spend about 3 months at the start of the year writing a lot of this, and my brain is still numb from actually having to go back a month or two later and write down what all these rules, inputs and measurements are, do, and mean!

For reads its not that easy, if a read request comes in and its a miss, we have to get it from disk, so just like any controller, send too many random reads, and you can suffer. However, with SVC’s wide strping, random workloads can really benefit from simply getting more spindles running at once.

The long and the short of this, is that when comparing 8G4 hardware with CF8 hardware, read performance (IOPs and MB/s) has doubled (primarily from the new hardware itself) but write IOPs have increased by up to 145%. This means that running the new 5.1 code on existing hardware has a benefit too, of around a 20% boost – but the key thing is we can drive these higher rates with much lower response time as well.

When a cache has some amount of data in it – you want to slowly, or as we call it, trickle, any dirty write data out to disk. This is something we’ve also adjusted, and we smoothly ramp the trickle rate, based on how much dirty data is in the cache. The new hardware is so quick that we also found we could ‘trickle’ the entire write cache out in a few seconds if you had enough backend disks! The cache now sets a goal based on elapsed time during trickle mode, which means we send a small batch of writes at a time. Doesn’t sound that interesting, but when you actually realise that this has an unique benefit when it comes to SSD performance. We all know that today’s NAND based SSDs are great at reads, but writes are usually a lot lower (still many times an HDD) – and as a result mixed performance is usually pulled down to around the write rate – this is for various reasons, like erase blocking, general write processing etc etc. But with the cache sending a small batch of writes, then a sustained batch of reads, and a small batch of writes etc etc – this actually starts to look much more like a read workload to the SSD, so the mixed performance is not actually pulled down to the write performance. Hence, we can actually drive the STEC ZeusIOPs to a higher mixed IOPs rating than even STEC themselves claim from the physical device. By as much as 30% under read biased workloads!

All of the number watching, SSD investigating and cache monitoring over the last few years almost feels worthwhile!

We’ve also added the ability to dynamically turn the cache on or off for any given vdisk. Something that until now has been a create time only option.

These are just a couple of the areas we’ve spent enhancing in the last 12 months, and in my next post I’ll cover the enhancements we’ve made to the copy services, both FlashCopy and Metro/GlobalMirror and some of the finer details of the smaller items that don’t make the marketing material!