ORIGINALLY POSTED 18th December 2008
8,698 views on developerworks
In the final part of my ‘How it Works’ series I’m looking at upgrades. Both software and hardware.
As the world has moved to a 24x7x52 culture, nobody can really afford many, if any, ‘maintenance windows’ these days.
I described in part2 that to move from a non-virtual to a virtual environment you do need to insert the virtualizer into your infrastructure and so re-map the disks, in a before and after style. No matter what virtualizer you choose, today this step is necessary. But you could think of this as your next (and final) migration of your storage mapping.
Suppose you are lucky and get to replace your storage controllers every 3 years. You have to do this migration and re-mapping every 3 years. To actually move all the data it may take 2 months (either side of the migration) so you really only have 2.7ish years of usage. Consequently, every 2.7 years you have a couple of months of weekend work, working with vendors, service orgs and making sure all your hosts and applications come back up once the migration is complete.
In the early days of SVC development we realised that once virtual, you didn’t need to do this ever again. However, depending on how well (or badly) we (or others) architected our virtualization products, you may have to do this again.
Non-disruptive Software Upgrades
So first of all software. Well its pretty much taken for granted these days that software upgrades need to be online, non-disruptive, concurrent. (Someone maybe wants to tell Microsoft that… still…) But for sure in a storage, especially enterprise storage environment, it better be the case.
Here we need to get into some of the nitty gritty architecture of our SVC cluster. Each node is running two environments. A cluster state machine, and a local agent. The cluster state is basically everything that all nodes need to know, the cluster state. This includes configuration, replication state, and anything that is needed to ensure another node can take over control and process I/O without any perceivable interruption of service to the using systems. The agent code is what a particular node is doing at that time, so an I/O to disk X, an I/O to host Y.
When we upgrade the software, because we run in a redundant I/O group (node pair) manner, we can afford to take out one node from every I/O group and continue to operate without issue. So thats what we do. First we flush the write cache data to disk, and each node runs in write through mode. Once a node pair is happy that it will not cause a redundancy issue if it were to lose a node, we stop the SVC code on one node (lets say the odd numbered node) and start the software upgrade. This node will likely have to reboot after the upgrade (as we may upgrade the underlying OS or kernel modules) and then it comes back and re-joins the cluster.
At this time we’ve done the same on all the odd numbered nodes. So half the cluster has the new code. However we don’t start running the new code just yet. We split it. Because the “agent code” is independent of the cluster state, we can start running the new agent code. However the same node runs the old cluster state / code. Thus the cluster is quite happy, it sees X nodes all running the old code. Now we wait.
The wait is to help out the multitude of controllers, OS, HBA and most importantly the multi-pathing software to recover. Only when all the paths to all the disks are back can we continue. Some MP software takes a looooong time to recover…
Now we repeat the process with the even numbered nodes. So we should end up with all the nodes back and online, running the new agent code, but old cluster code. Once everyone is back, all nodes commit the upgrade and roll forward. This may well include specific code that translates old to new from a cluster state perspective. Thus, all nodes roll forward, updating the cluster state (at the same time) and so enables us to add new features, new structures and new meta-data to the cluster state without issue.
Meanwhile… back at the application, nothing has happened. OK, so we’ve taken half the cluster out at a time, and we’ve stopped caching, so you are at the mercy of the disk systems/controllers themselves, but thats why we recommend you don’t go above 75% utilisation per node. So that you can happily run the same I/O load through one node (without having to mirror cache writes – the workload reduces)
It works… we’ve got over over 15,000 nodes out there, released over 15 major releases of software, numerous PTF releases per major release, thats A LOT of software upgrades in real time…
So thats software… but what about hardware…
So software ain’t so difficult…. LOL! But hardware, well again with the design of independent nodes, where each node can be a single point of failure, yet the cluster continues… actually hardware upgrades for SVC are easy.
This has come up quite a bit recently in the blogosphere. Storagebod was asking about it, a few comments on my blog have asked about this, and HDS have realised we are killing them in bids when customers ask about this…
Because we can take an I/O group (node pair) down to one node, we do that. Shutdown one node, delete it from the cluster, rack up a new node (next gen, 2nd gen, 3rd gen etc) and power it up.
Now comes the clever bit. Because a node has a WWNN, if you simply add a new node in place of an old one, yes SVC will be happy, but what about your servers. They see a completely new path to some disks and barf. SVC controls its own WWNN. There is the base “IBM SVC” part of the WWNN, but the last few digits are usually controller by the display panel. So tell the display panel its actually the old display… or WWNN… et voila.
The procedure for node upgrade therefore lets you change the WWNN to that of the old node, and so the fabric, switch, director, zone and most importantly the host and application sees the same vdisk on the same path.
Repeat the process for all nodes in the cluster and you’ve just upgrades all of the SVC hardware without ANY disruption to your hosts or applications.
As I understand it, SVC is the only system that today offers this capability. I’ve heard HDS are looking at some funky dual access mode for USP, but without serious end-port-virtualization (forever more) how is that going to work?
Anyway, this 4 part work all started from Hu poking me in the wrong place to call SVC a SAN Virtualizer and not a storage virtualizer, with lots of finger pointing and mis-conception-understanding on his part about what SVC can do, and hopefully for those of you that have stuck with me, and look back, EVERYTHING Hu mentioned SVC can do, and some.