RAID 5 Versus RAID 6

ORIGINALLY PUBLISHED 13th June 2016

Hi All,

I was at the US user group last week – and this topic came up quite a lot.  The main reason that we started discussing R5 versus R6 is because the development team are currently recommending that all Distributed RAID arrays be configured as R6, and people are remembering the whole “R6 is slower than R5” adage that has been around for some time.  So let’s consider the pros and cons:

Reads

Lets start with Reads – because that is simpler.  When a RAID array is fully redundant (no drives offline or rebuilding) a read from RAID 5 and a read from RAID 6 are basically identical – they both work out which drive contains the data and they read that drive.

Writes

This is a bit more complicated, so lets try and compare the two write code paths.  Note this gets a bit technical – feel free to skip to the summary.

There are two types of writes in RAID – “short writes” and “Full Stride Writes” (aka “Full Stripe Writes” ). 

Full Stride Writes

Full Stride Writes are cases where the Cache contains the data to overwrite all of the data blocks in a stripe.  If you don’t know what this means, then don’t worry too much – basically it’s the most efficient way of performing writes, but it only normally takes place when the application is doing sequential write workloads.

R5 Full Stride Write:

  1. Calculate P Parity
  2. Write data and P Parity  to all drives in the stride

R6 Full Stride Write

  1. Calculate P Parity
  2. Calculate Q Parity
  3. Write Data plus P and Q Parities to the drives in the stride

Short Writes

Short writes are anything smaller than a Full Stride Write.  This means that the RAID code needs to read the old data from a subset of the drives so that it can calculate what the new parity will be

R5 Short Write:

  1. Read Data that is being overwritten
  2. Read P Parity
  3. Calculate new P Parity
  4. Write Data and P Parity

So that is two reads, two writes and one parity calculation.

R6 Short Write:

  1. Read Data that is being overwritten
  2. Read P Parity
  3. Read Q Parity
  4. Calculate new P Parity
  5. Calculate new Q Parity
  6. Write Data, P Parity and Q Parity

So that’s three reads and three writes and two parity calculations

Comparison Summary

There is no difference for Reads

The only difference for Full Stride Writes between R5 and R6 is a small amount of additional CPU

For Short Writes, R6 uses more CPU and 33% more drive operation than R5

Conclusion

So we’ve seen that the old adage is true – R6 is slower than R5 – and now you know exactly why.   However – this is not all of the story.

What does it matter if R6 writes takes longer per IO that R5 – the write response time is hidden from the hosts by the write cache.  Unless you are doing large continuous writes that fill up the cache then this won’t be an issue because the cache will hide the latency anyway.  And by the way the type of workload that fills caches up are often full-stride writes anyway – which is the best case.

Most customers I have seen running Storwize systems are using very little CPU unless they are doing lots of replication to another system.  So the additional CPU is not likely to be a major concern for many customers.

So the biggest thing here is the number of IOs which the drives have to handle.   It is definitely true that R6 will be able to achieve fewer IOs per second than RAID 5 when using the same number of drives.   But if you use RAID 5 then there is a much higher likelihood that you will suffer a double drive failure.  Especially with large capacity drives.

So my overall summary is this – use R6 for everything over 1TB in size.  If performance is a concern – then buy more drives or more powerful systems with faster CPUs.  Surely the cost of downtime and the cost of lost data is much higher than the cost of a few additional drives!

Final Thought – DRAID

So after all this – you may be wondering why the development team is recommending RAID 6 for all of the DRAID arrays.  The answer is this – whilst DRAID will drastically reduce the rebuild time and reduce the window during which a double drive failure would be disastrous, it also massively increases the number of drives in the array – which increases the likelihood of a double drive failure.

So even with the fast rebuild time, that doesn’t get away from the possibility of a double drive failure.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: