Lets get this right, once and for all, ALUA != passive – Active/Active is not precluded by ALUA – Barry Whyte and Andrew Martin : IBM Storage

First off, I wrote about this seven years ago… but it seems the usual suspects and some newcomers are back on the FUD spreading band-wagon trying to say that ALUA must mean active/passive … it doesn’t!

Secondly lets get the main fact on the table straight away :

Spectrum Virtualize, hence SVC, Storwize and FlashSystem are all active/active controllers both in terms of pathing and I/O processing.

IBM has never implemented an active/passive solution. Most vendors have gone away from them too (EMC) – except Pure, who implement active/active paths, but active/passive I/O handling and try to claim it as a benefit….

Really that’s all that needs to be said, but this comes up again and again, particularly if you have been drinking too much of that orange Kool-aid

But ALUA…

So why the confusion and or FUD… well it all comes down to the part of the SCSI spec that implements a path discovery description that is known as Asymmetric Logical Unit Assignment (ALUA). If you have access to T10 SCSI documents you can go and read all about it, but for those with neither the will nor desire to go that far, ALUA was originally intended to create target port group assignments.

The basic idea is a way to advertise that some access routes to a given LUN or group of LUNs may not be as efficient as others.

Some early implementations of SCSI based storage controllers (going back 20+ years) took a lazy approach and built 2-way controller systems, but with one controller being a passive device. In such a system (DG anyone) all data actually flowed through just one controller and only if that failed would it make the painful act of failover occur, thus bringing the passive controller online and moving all I/O flow to that now active device. But at any point in time, only one half of the the system was actively processing I/O. It makes for simple coding, much simpler error handling and generally can be seen as a poor man’s approach to implementing a 2-way system. (I find it amusing that a certain orange vendor attempts to turn this into a positive in their marketing – again, avoid the sugar rush from the Kool-aid)

So ALUA has been tainted with the active/passive mantra, when in reality ALUA was just how they butchered the SCSI spec to implement such an active/passive system.

ALUA – optimised/non-optimised

The optimised and non-optimised flags on a path can however be used to provide an end user performance benefit.

Spectrum Virtualize based systems use the optimised/non-optimised flags (by default) to create what we call preferred and non-preferred paths. Each LUN is aligned with a node and paths to that LUN on that node will be marked as preferred. Paths to the other node as non-preferred. Straight off, we have a nice bit of load balancing. But remember, don’t confuse this with active/passive.

Both preferred and non-preferred paths are active.

That is, you can send I/O to preferred and non-preferred paths and the system doesn’t care. It will handle them, and there is zero performance impact for sending an I/O via a non-preferred path.

So why use this …

The nitty gritty details…

If you’ve made it this far, then we are now into the weeds as they say,.

Read I/O

When a read request is received by a controller the first thing it will do is look to see if it has that data in cache. If yes, great, we can return it without doing any backend storage (disk/flash) I/O and all is good.

How does data get into the cache? Well if we read data from the backend, we may as well hang onto it for a while (if there is space in the cache) – the application my read it again soon.

If you’ve written data, and we’ve written it to the backend, we can turn that into potential read data now (since a read of what has just been written == what has just been written!).

Finally a cache will often look out for sequential workloads and, when detected, will start to guess what you are going to ask for next (read-ahead or pre-fetch).

Read data is not usually mirrored, it only stays in the cache on the node that processed the I/O.

If you always send I/O for a given LUN to the same node you increase the probability that a re-read will hit data already in cache.

For example, if you read the same bit of data twice, but send the read to two different nodes, then each one will have to read from the backend because the read cache isn’t mirrored. The same goes for pre-fetch algorithms, if the sequential stream stays on one node, we can easily detect it and start reading ahead. If you jump around between nodes, we don’t actually see sequential I/O, more like a jumping square wave pattern…

So for read workloads, by honoring the preferred path, we can improve potential read performance (reduce latency by getting more cache hits)

Write I/O

For write I/O, its simple. There is no actual performance difference no matter where you write to. Since a write has to be mirrored to both nodes in a caching controller, then writing to the preferred or non-preferred node has the same end result – both nodes get a copy before the write is acknowledged back to the host.

However, think about what the controller has to do with the write. As far as the application is concerned its done. But all that has actually happened is we’ve made two copies of the data, one in each node’s cache memory. At some point later the controller has to write that out to the disk/flash media itself to permanently store it (i.e. destage the write). The last thing you want to do is let both nodes to do that, not only is it wasteful, but also you run the risk of creating inconsistent data at the backend. So one node is nominated to do the destage. For this we use whichever node is the preferred node, i.e. the one that advertised the preferred paths. Its responsible and again has the nice effect of load balancing the backend destage work.

So there are the details, that’s why we have a preferred and non-preferred, optimised and non-optimised ALUA based implementation, but remember, don’t confuse path optimisation, with path activation – ALL paths are active no matter their optimisation state.

Disabling optimised/non-optimised

Hopefully after reading this you will see that there are real end user benefits to having the optimised and non-optimised pathing rules. But if you really have been brainwashed and want to reduce your potential cache hit rates you can modify your system to ignore the ALUA path settings.

In most multipath configuration tools (scripts, config files etc) you can tell them to use a different pathing model. By default they will recognise Spectrum Virtualize LUNs and setup the pathing as described here – but if you set them to “fixed” or “round-robin” and remove the “weight” or “priority” settings then the system will use a different pathing model and you can send I/O to all active paths, preferred or not.

But as I say, really, don’t get sucked into the FUD spreading, ask your vendor directly and if they say that ALUA is bad ask them why? If they start down the active/passive argument, take them to task and ask them what about active/active with ALUA… surely that *could* be a benefit if you implemented it like we have.

2020

11 responses

Igor

October 29, 2020 at 12:16 pm

>>Read data is not usually mirrored, it only stays in the cache on the node that processed the I/O.
So, why not just go and check other nodes cache? That would virtually double read cache for all luns.

LikeLike

Reply
1. Barry Whyte
  
  October 29, 2020 at 6:06 pm
  
  Latency and bandwidth implications… so you would need to add a message to the node to node queue for every read operation. This would add a lot of extra work and in normal running result in very little gain. If you serialized it, you are adding a round trip latency to all reads that miss in the other cache too. If you sent this query and the backend read in parallel it becomes a race to get the data and you double the bandwidth usage and operational overheads to process two completions etc… so for a very low probability of a hit you would add too much additional processing, wasting mips and bandwidth and generally increasing latency.
  
  LikeLike
  
  Reply
Andrew Martin

November 3, 2020 at 10:03 am

Also – when implementing Active Active solutions (e.g. hyperswap) the ALUA settings in Spectrum Virtualize are carefully crafted to ensure that host traffic doesn’t take an unnecessary hop across the long distance link, thereby massively improving the solution performance and reducing the bandwidth needed on that long distance link.

LikeLike

Reply
Anonymous

January 5, 2021 at 1:05 am

One reason of Active-Active path requirements is simplify the management of vdisk allocation. Storage admin don’t need to balance the vdisk’s preferred node manually.The other reason is fast failover time. The active-active path is the view on fron-end fc ports. The true requirement is symmetrical access on LUN level which mean both of nodes can handle read/write i/o concurrently. In such symmetrical access mode, some of vendors can shorten io pending time to 1s when one of nodes failed.This has positive value to online transactions applications.

LikeLike

Reply
bongsf

August 26, 2021 at 2:52 am

Hi:
Regarding your statement about “zero performance impact for sending an I/O via a non-preferred path”, is it true for write I/O only?
As I can see if a read I/O is issued via a none preferred path, from what you mentioned above the none preferred controller will take control of the Lun to read the data. This will affect the the write I/O which is normally handled by the preferred controller. Lun ownership transfer is somewhat costly. Please enlighten me on the mechanism involved. Thanks.

LikeLike

Reply
1. Barry Whyte
  
  August 26, 2021 at 2:57 am
  
  There is no taking control involved. Both nodes can process IO for the volume at the same time. Write order consistency is maintained because the cache mirrors the data between nodes. So there is always a copy of the latest write data for either node to read. If the data is not in cache the node that received the read request reads from backend disk. So there I no penalty for writes or reads. The data is either in cache or on disk and both nodes know if it’s in cache.
  
  LikeLike
  
  Reply
bongsf

August 26, 2021 at 6:51 am

Hi Barry:
Thanks for your almost instant reply. I understand the write part, as there’s no added cost to the write IO when issued to the none preferred path. I guess it’s something like this:

– store in (none-preferred) controller cache
– mirror write to partner controller cache
– ack
– then the partner controller will write back to disk at a convenient time

as oppose to if the write is issued to the preferred controller, the number of steps is similar

– store in preferred controller cache
– mirror write to partner controller cache
– ack
– write back to disk

However, if the read is issued to the none preferred path, I’m thinking it’s like:

– look up in the (none-preferred) controller cache
– if found, done
– if not, take over lun, read from disk (?)
or
– ask the owning controller to perform the read, pass the data through interconnect to alternate controller, and return to host?

This is the point where I think it might disrupt the lun write, which under normal circumstances should be owned by the preferred controller. If the lun is a busy one would the lun ownership be transferred forward and backward?

LikeLike

Reply
1. Barry Whyte
  
  August 26, 2021 at 7:11 am
  
  So no, you don’t need to take over anything. Both nodes can simultaneously read from disk. On a read, if the data not in cache, the disk has the latest data and no interlock is needed for either node to read it. So the steps and task are IDENTICAL no matter which node reads, preferred or not.
  
  That’s why it’s truly active active and not active passive.
  
  The concept of ownership never changes and the owner only controls the destage from cache to disk. So when the write arrives it instantly mirrors to the other node. If that is in progress the wrote into both cacbes completes first and then a read would be processed. Assume the read came in after the write, on either node.
  
  LikeLike
  
  Reply
bongsf

August 26, 2021 at 9:01 am

Thanks Barry for your explanation. Can I say this is just like any OS/clustering technology whereby simultaneous read from multiple hosts is allowed, but only single node write is permitted?

My next question would be, how is this different from a true AA storage?

LikeLike

Reply
1. Barry Whyte
  
  August 26, 2021 at 9:12 am
  
  My point exactly. Just because we advertise a preference doesn’t stop the system from handling IO in an AA manner.
  
  Some other vendors claim this preference stops it from being AA, which as we’ve concluded it doesn’t.
  
  Compared to a true A/P model where all the ownership grief you outlined causes bottlenecks and thrashing etc.
  
  Pathing models and execution models are independent.
  
  LikeLike
  
  Reply
bongsf

August 27, 2021 at 3:12 am

Hi Barry:
I guess if there’s a ‘preferred’ controller identified for a LUN, as the name ALUA suggests, we assume the cost to access to the LUN via different paths will be different.
I don’t have in depth knowledge in the Storwize storage, being at user level for a few 3700 and 5000. But I do notice the ALUA in storwize behaved quite differently as to, say the LSI OEM (DS35xx, Sun ST2540M2) or the dothill OEM (HP MSA, Dell ME4xxx). In essence, even though advertised as ALUA models, the dothill models still behave like A/P storage. The Storwize just take in IO via all paths connected to preferred AND none preferred controller.

LikeLike

Reply