ORIGINALLY PUBLISHED 6th July 2015
I wanted to mention a feature that we released recently which we are calling the SVC Standby Engine. Unfortunately this feature is not available for Storwize systems, so it’s probably not worth you reading this, but let me know if it’s the kind of feature you would want on a Storwize system. That type of feedback is always valuable when prioritising future features.
What is this Standby Engine Feature?
This feature makes use of the existing non-disruptive hardware upgrade procedures to allow users to have a standby node which can be used to replace a broken node in the system.
There seem to be two different types of customers who really want this type of technology:
- Customers who consider running in a single point of failure as a critical impact to their business
- Customers who believe that the performance degradation involved when running without cache is not something that they can tolerate for an extended period of time.
The basic use case for the Standby Engine is as follows:
- One of your nodes has a hardware fault which needs to be repaired and the node goes offline
- The SSR is dispatched to replace your part – but you don’t want to wait for him to get onsite before restoring redundancy
- You initiate the Standby Engine procedure to replace the broken node with a standby node which is already powered, cabled and ready to go.
- The Standby Engine restores full redundancy to your SVC cluster and the Hardware repair procedure can be carried out without stress.
This seems like a nice idea …. yes?
Well – this is actually a feature that IBM developed in conjunction with a very skilled engineer named Dave Lounsberry from Sprint Corporation. We worked out that we didn’t need to write any new SVC code to support exactly this feature – and we could make it available as a procedure on many existing SVC software levels. Now that I’ve named Dave, I should also mention Cameron McAllister from the team here in Hursley who worked with Dave to try and make sure we’d covered all of the possible things that could go wrong.
Hopefully at this point at least some of you are interested in learning more – so I’m going to give you a couple of key pieces of information:
- The full Standby Engine Procedure is documented in a Redpaper here: http://www.redbooks.ibm.com/abstracts/redp5162.html?Open
- An example script which I wrote that automates most of this procedure is available from developerWorks here.
- IBM will now allow you to purchase a single node (rather than a pair of nodes) so that you can purchase a “standby node” for your configuration
The information in the links above should be everything you need to be able to configure standby engine for your environment. Make sure you read and understand the Redpaper before trying to deploy this yourself (it’s only 10 pages, it won’t take you long).
Some more details for those of you who are interested in deploying this
Q: Do I need one standby node per cluster, or can I share a standby node between clusters?
Well…. It depends.
My script requires one standby node per cluster for simplicity, but there’s nothing to prevent you from thinking about the problem some more and sharing a standby node between clusters.
Here are the main things that I think you should consider when deciding whether to share standby nodes or not:
1/ Software Upgrades.
The whole point of a standby node is to be able to restore the redundancy of the system rapidly. If you want to achieve that then you need the standby node at the same code level as the cluster. If you are in the middle of an upgrade in your environment then you will have at least 2 software levels in your estate. That means that your standby node now takes 30 minutes to deploy rather than 2 minutes.
You can solve this problem by having a small number of standby nodes – as long as you do enough prep work and enough thinking (and write a much more complicated script).
If you have multiple SVC clusters, then it is not likely that they are all plugged into the same SAN directors. If your standby node is not plugged into the same directors as the original failing node, then your standby node will have different performance characteristics than the original node. ISLs between SVC nodes is just not a good idea. They’re great when the ISL is working and clean, but when things start to go awry then they can hurt.
You could consider using the standby node temporarily and do a “fail back” after the original node has been repaired to avoid the worst of these problems.
Q: Can I use Standby Nodes with Stretched Cluster?
Absolutely! Unfortunately not with my script – because it requires too much knowledge about your environment, but it is entirely doable.
Q: You called the script an Example script – can I deploy it “as is”?
The script can definitely be used as is (as long as you meet the requirements of the script). I know of at least 2 customers that have used exactly my script. However if you have the time and the skills, then some small modifications to integrate the script more completely into your environment may make it even more useful to you.
The biggest differences between my script, and the procedure documented in the Redpaper are:
- The script pauses in the middle of the procedure and tells the user to disable the switch ports before continuing – it does not automate the switch configuration changes. (This requires the script to know about your switch port naming conventions so that it can automatically find the correct ports to disable)
- The Redpaper recommends changing the WWNN of the failing node to a “temporary WWNN” if the node is still accessible on the service interface. (This requires you to have configured service IP addresses on every node, and the script needs to know what those IP addresses are)
If you were looking to extend my script then these are the two changes I would make first.
Lab services in the US have deployed my script to a number of customers (without modifications) so they may be able to help you implement a Standby Engine in your environment.
Q: Any known host problems?
I’ll be honest – this is a cheeky one, because no one has ever asked me that question – but there is something I would like to tell you about that this question leads nicely into.
As far as hosts are concerned – all they should see is paths failing with some errors and then paths coming back. We do not expect this to be a problem for multipathing drivers. However multipathing drivers do have bugs so we can’t be certain that all hosts will fail back. I’m not going to try and list any known issues here – but I did want to point out something on AIX.
Most operating systems track their storage device by the following two attributes:
- name – in this case the WWPN (world wide port name)
- SCSI LUN
AIX does it a bit differently and uses
- address – in this case NPortID aka FCID
- SCSI LUN
If we use a post-office analogy, then when you swap in a spare node, that’s a bit like you moving house. After the standby engine has been deployed then your name (the WWPN) hasn’t changed, but your address (the NPortID) has. This analogy doesn’t really work, because there’s no way of just sending mail to “Andrew Martin – somewhere in the world”, but in Fibre channel there is, so we’ll pretend that someone invents a way ;-).
If you are windows, your multipathing driver doesn’t even notice that Andrew Martin has moved house, he just keeps sending messages to me and I keep receiving them.
If you are AIX, your multipathing driver keeps sending mail to my old house, and I stop receiving them (because I didn’t set up any mail forwarding).
(OK – I give up on the analogy now, hopefully it was a little bit helpful, even if it was straining at the seams of believability)
This basically means that AIX multipathing drivers detect the new address as a new path, rather than as the old path “recovering”. And they don’t automatically configure the new paths into the system. So until you run cfgmgr to configure the new paths the AIX host won’t fail over to the new paths.
AIX has a feature called dynamic tracking which can be enabled on the adapter and allows AIX to behave in the same way as other hosts when a path moves from one NPID to another NPID – unfortunately I don’t believe this is turned on by default.
Right – one last minor thing on this topic. The AIX issue I just described doesn’t occur if you have Cisco directors and the WWPN only moves to a different port in the same Director. This is because the NPID in Cisco is not tied to the physical port, it’s tied to the WWPN. So if WWPN XXX appears on any port of the director, it will always be given the same NPID. If I were to try and shoe-horn this into my analogy, then I guess it would be a PO Box. So even though you moved house, because you’re still close by you can continue to use your existing PO box, and therefore your address is still the same.
Q: Insert your question here
There is no chance that I got all of your questions covered here – so feel free to ask more below and I’ll try my best to answer them when I can. I’m also interested to hear if any of you have deployed a Standby Engine in your environment, and whether you used my script, my script with modifications, or some other mechanism.