ORIGINALLY POSTED 8th September 2007
(UNFORTUNATELY developerworks HAS LOST THE FIGURE IMAGES!)
12,376 views on developerworks.
Chris posted an interesting solution to provide 3DC disaster recovery, yet only having local virtualization devices.
I posted a couple of responses to this question, and commented that this is a complex area of I/O flow and I would describe here how SVC ensures the remote site is – what we call – consistent. That is, the remote site must always contain a useable copy of the data, which may be slightly behind the local site when asynchronous replication (GlobalMirror) is used. However this does require an SVC cluster at the remote site.
SVC solves the problem by implementing the replication layer above the cache, thus ensuring that the remote site always has a consistent image, thus avoiding cache coherency issues as a result of replication.
To explain a bit further its important to understand the ‘stack’ of layers inside SVC. I’m providing this here to give readers a greater understanding of how SVC is implemented and answer Chris’s integrity questions.
Figure 1 – SVC Cache processing
Fig1 shows an incoming write I/O (brown arrows) to the top of the node stack which gets passed down as far as the cache layer. Here its mirrored to its peer node (The partner node in this I/O group). When this node acknowledges the I/O, the completion is passed back up and out of the node to the host. At some later stage the I/O is destaged and passed to the virtualization layer which translates the virtual disk I/O into the required managed disk I/O(s).
So far nothing unusual, and certainly no rocket science. Chris asked about integrity, with respect to keeping in-order I/Os on the backend disks. Well thats not really how virtualization, nor any caching disk controller system works. Yes, in-order processing is guaranteed. That is, at any time an I/O request to the SVC will return the most recently written blocks for that request, however this can come from cache or disk. The whole point of a cache is to keep the commonly referenced data in memory. Thus, something that is written often may not be destaged to disk for some time as each ‘hit’ makes it rises to the top of the LRU list. As I say, this is business as usual for any caching disk controller. The question of coherency is covered by the mirroring of the data to the partner node. So if a hardware failure occurs on one node in the I/O group, the partner node contains a copy of the cache (dirty writes) and so will flush this data instantly to disk and continue to operate in write through mode. SVC was designed around no single point of failure (SPoF) and while there is some redundancy built into a node, the pair-of-nodes that form an I/O group were designed to implement no SPoF at the higher node level. Power failures are handled via the battery backup that allows the node to be held up long enough to flush cache data to an internal disk.
Figure 2 – SVC MetroMirror processing
With MetroMirror (synchronous replication) writes are not completed back to the host until the remote site has also written the data. This is generally only recommended with SVC for distances less than 300KM (local applications are directly impacted by the round trip latency). Fig2 shows how this works with SVC, the brown arrows showing the local host I/O. The data is mirrored into the local node caches and is replicated to the remote cluster, where it is again mirrored by the remote node caches before the acknowledge is returned to the local cluster. Only then is the write completed to the host.
Chris’s question of integrity, or what I know as ‘consistency’ can be answered by considering both Fig2 and Fig1. Because the I/O is only completed back to the host after the remote site has acknowledged the write has made it to the remote nodes cache, you can guarantee that should a disaster strike the local site, the remote site always has a consistent image that will pass application and filesystem data checking.
SVC provides consistency groups for all three major copy services, FlashCopy, MetroMirror and GlobalMirror. Therefore this consistency can span multiple vdisks.
Figure 3 – SVC GlobalMirror processing
The only difference when using GlobalMirror is that writes are completed back to the host before the acknowldgement is returned from the remote cluster. Fig3 shows the completion is returned after the local nodes have committed the data to cache (yellow arrows). However, here Global Mirror does guarantee in-order delivery, therefore the remote copy is always consistent, but some number of seconds behind the primary. SVC implements a very close in time form of aysnchronous replication.
However, after all that explanation, Chris’s proposal is not possible without an SVC at the remote site, which would bring the solution back inline with an industry standard 3DC layout. The SVC cache as Chris suggests does not guarantee ‘on disk integrity’ unless the cache is first flushed to disk, therefore without an SVC at the remote site you would not have a consistent remote on-disk image. More importantly, due to the major risk of data-miscompares, connecting the same disks to both SVC and directly to a host is not supported. An interesting idea though Chris.
Added after intial post :
I’ve just thought of a way this could theoretically be done using SVC if it supported both intra-cluster and inter-cluster replication of the same virtual disks (which today it doesn’t).
Please note, this is not supported or in any way recommended, I just thought it was worth pointing out that technically it is possible!
SVC does support two modes of replication, intra-cluster (within a single cluster) and inter-cluster (normal mode of operation to a remote peer cluster). If you were to setup the sites as per Fig4 then it would technically be possible.
Figure 4 – 3 Site DR with 2 clusters – in theory
So the vdisks at the local primary site need to be in ‘image mode’ where there is a 1:1 mapping of vdisk to mdisk – that is, no virtualization striping. The cache can be enabled here for these vdisks. You MetroMirror these disks to the local disaster recovery cluster, where the target disks are again cache enabled image mode vdisks. SVC does not state any actual configuration requirements on the latency of disks; if it takes more than 5 seconds we will start to send OTUR (Ordered Test Unit Ready). Therefore, in theory the disks could be very far away from the cluster. At the local disaster recovery cluster you make use of the intra-cluster GlobalMirror function and replicate to the remote disks at the third site. These vdisks would have to be cache-disable image mode. Thus the copy at the third site would be consistent at some point in time in the past (a few seconds behind the local primary copy). You would have to ensure that you did not mount, or mounted these disks as read only at the third site – until such time as you wanted to activate this third site.
Yet again, let me just clarify this is not something that is supported or is likely to be supported, but an interesting concept none-the-less.