Just what is non-disruptive?

Nov 6, 2019

Barry Whyte

ORIGINALLY POSTED 23rd July 2008

14,567 views on developerworks

I was catching up on a few interesting comments on BarryB’s post about the evolution of DMX – notice how he deftly ignored my loaded questions about host side IOPs / MBs – but then we all know EMC, for some reason, just won’t publish any kind of yardstick for their products – ho hum.

Anyway, the following comments got me thinking about what we as vendors mean by non-disruptive. We all use it to describe upgrades to our boxes, to varying degrees of accuracy and realism. I have to therefore turn to what we mean by non disruptive from an SVC perspective.

No interruption to service.
No interruption to access.

It is likely this was the true definition as originally defined by the first vendor to coin the term, but in practice users will encounter different perceptions of different products. It is our goal that any supported upgrade should complete without disruption, and we test this extensively during system test as well as our ongoing stress and regression testing. Although our support matrix lists the recommended levels of software that should be used with the new code, it is mainly because we like to keep current with associated software levels to cover all installs and may have discovered problems or issues in previous levels of the associated code. Thats not to say for any reason that a previous level of code may work in your environment with the latest level of SVC. Caveat – don’t be surprised to be asked – by any vendor – to upgrade if you do experience a problem, however one of our goals is to not suggest this as a pre-req, only as a result of analysis of issues and known problems.

One key thing that a product like SVC promises is the ability to perform non-disruptive upgrades of underlying storage controllers. Yet, we’ve now introduced a layer inside your SAN that itself must be upgradable. That’s why we test SVC code upgrades to the ultimate degree. During the upgrade each node in turn is upgraded, however our ‘split execution’ model, of 1-way code and clustered code, means we can run the new 1-way code during the process and then the cluster itself makes a decision if it should roll forward to the new code once all nodes have access to it. Once the nodes agree, it happens. We don’t ask that your stop applications or reduce workload during this time, as only one node in the cluster is upgraded at a time and this is done in a controlled manner that should limit any performance impacts. Inevitably most upgrades will happen at weekends or during periods of low utilisation, and this can make sense from a business continuity point of view should something outwith SVC’s control happen during this time. Its critical to ensure the health of your SAN, hosts and multipathing layers prior to commencing the upgrade. We provide tools (CCU checker) to help users to health check their cluster.

Now that you have a layer between your hosts and storage, it must be itself upgradable. So software wise, we are covered. Hardware wise, yup we got that too. Recently we had a large (not at liberty to name) but key customer who was reaching the peak of what their original 2003 4F2 SVC nodes could achieve, they physically replaced all eight SVC nodes without issue and saw an improvement in response time from their existing storage, but from an SVC point of view utilisation dropped from >75% to <25%. This was all done non-disruptively (even the WWNN’s were migrated so no re-zoning or reconfiguration of the SAN is needed either). I do not know of another Storage Virtualization product that can do the same, without some kind of host migration software, 3 month migration cycle or sometimes its just not possible. Ask HDS or EMC if they can replace the hardware online in a few hours – I know the answer and I’m sure you do to.

Even if you have to take a controller completely offline for some disruptive maintenance our latest 4.3.0 code allows you to use the Vdsisk Mirroring feature to prepare for this event, offline the controller, fix it, and then only sync back the data that has changed since it went offline.

Its difficult to find a single other product that provides such dynamic, truly non-disruptive and time-saving features that allow you to solve common (painful) problems storage administrators face on a daily basis.