Some details about how DRAID works

ORIGINALLY PUBLISHED 17th March 2016

Hi,

There have been a number of DRAID questions on the forums recently, so we thought it was worth giving you some extra details about the benefits of DRAID. Ian Boden has provided the following information, and if you ask nicely he may come back and give us some more details in the future. Please feel free to ask additional questions in the comments and we will do our best to answer them.

You may also be interested in this question in the forum about multiple drive failures: https://www.ibm.com/developerworks/community/forums/html/topic?id=907358ae-9320-43cf-9a12-709f2555872d

Hope this is helpful

Andrew

Hello all,

By popular demand (well one person mentioned in passing) I’m here with a guest post to talk a little about Distributed RAID and performance.

For those who don’t know who I am, I’ve worked on the SVC code for about 8 years now, the majority of that time has been working on performance sensitive pieces of code. I spent a long time in cache, first modifying it to work better with RAID for when we launched the original Storwize V7000 and then working on the cache re-architecture, I then did a bit of work on compression before moving onto distributed RAID.

What is DRAID

Distributed RAID was launched in 7.6.0 and allows a RAID5 or RAID6 array to be distributed over a larger set of drives. Previously if you created a RADI5 array over 8 drives, the data was striped across them with each stripe having a data strip on 7 of the drives and a parity strip on the 8th. In distributed RAID5 you specify the stripe width and the number of drives separately so you can still have 7 data strips protected by a parity strip but those 8 drives are selected from 64. On top of that we added distributed sparing, this is the idea that instead of having a spare sat on the side that isn’t being used, each drive in the array gives up some of its capacity to make a spare.

Rebuild Performance

The main reason for distributed RAID is to improve rebuild performance. When a drive fails, the data from that drive has to be rebuilt from the surviving drives and written to a spare. By having a larger set of drives in the array those rebuild reads are coming from more drives and distributed sparing means the writes are going to a larger set of drives. Reading from a small set of drives and writing to a single drive is what causes rebuilds to take a long time in traditional RAID, especially if the drive you are writing to is a 4TB nearline drive. When I say a long time, I mean over 24 hours. With RAID5 a second drive failure during a rebuild means the array goes offline and if the drive can’t be resuscitated the entire array needs to be restored from backup, RAID6 copes with a second concurrent failure but a third is going to be terminal. Some products offer RAID7 to cope with 3 failures, but it is all just delaying the inevitable as the drives get bigger and rebuilds take longer.

As for how much the rebuild is reduced by, I don’t have any official figures, what I can say is some configurations result in the rebuild completing in one tenth of the time an equivalent rebuild would have taken using traditional RAID.

Number of Drives per DRAID Array

So one of the key decisions anyone has to make is how many drives to put into an array. As you increase the number of drives the rebuild time shortens but it isn’t linear, and it doesn’t go on forever, you start to hit other limits in the system. Our testing shows that about 64 is the sweet spot for spinning disks, the GUI will recommend between 40 and 80 assuming that you have at least 40 of the drive class you want to use. Typically I would suggest going with what the GUI recommends.

One of our recommendations is that you have heterogeneous pools, having different drive classes within the same tier is not a great idea. Unfortunately, the current GUI implementation is a little over zealous, and so can start disallowing some things that are fairly reasonable, such as someone trying to add a second array with drives that are slightly larger than the ones that make up the current array in the pool. For now you have to resort to the CLI to work around this behavior.

There are times when you will want to override the GUI recommendations, the main reason is likely to boil down to the current limitation that arrays can’t be expanded by adding new drives. So you might want to think about what you will do when the pool runs out of space, if it’s a case of buying a new expansion with 24 drives then you might want to use 24 drives per array, or you might want to look at buying 2 expansions and adding 48 drives when it gets close to capacity.

Array Performance

Then we get onto the more complicated topic of performance, some people throw around the “rule of 4” as a commandment that cannot be disobeyed. If you aren’t aware of the rule of 4 it stems back to the fact that V7000 gen1 has 4 cores for processing IO, each volume gets assigned to a core and each array gets assigned to a core, so a single volume using a single array could be only using 1 core for the majority of the IO processing. If you are in that situation and you have a performance critical system, and you have a workload that may be limited by the processing power, then you might want to use 4 arrays. However, with distributed raid came some improvements that allow some of the io processing for an array to be done outside of the assigned core so the problem is reduced, there are also performance advantages of having a single distributed raid array. If your system happens to fall nicely to having 4 arrays then that’s great, but going out of your way to contort it to fit the rule is just going to give you more overheads.

The reason why one large array can improve performance is due to the SVC code allocating extents from a pool to a volume. By default those extents are now 1GB, that means if reading or writing to a 1GB extent you are only using the drives in one MDisk. For a random workload you would expect multiple extents to be active across all of the MDisks so all the drives are in use, but for a sequential workload you tend to hit one extent rather hard then move onto the next. The cache tries to improve things but it’s still very easy to get a situation where some drives are sat idle. A distributed array uses every drive in the array for every extent so getting all drives active is much easier.

Having said all that when, it comes to SSDs things are different. With a sizeable number of drives in an array, before you get drive limited you start to hit other limits in the system. There are plans afoot to overcome those limitations but for 7.6.0 I’d recommend keeping a distributed SSD array to 20 drives or less.

Rebuild Areas

One final topic is about the number of rebuild areas, a rebuild area is equivalent capacity to a single drive. So the more rebuild areas you have the more drives that can fail one after another. The number of rebuild areas you want, is a mix of how many drives you have, how important the data is and how quickly you want to replace a failed drive. Once a drive has been replaced, the data then gets copied back from all the spare spaces to the replaced drive, so this is another case of writing to a single drive which can take a few days on the really large nearlines. So the copy back time needs to be taken into account, replacing a drive doesn’t immediately give you back the redundancy. My advice would be to go with the default suggested by the GUI, then add an extra one if the data is critical (but don’t use the fact you are using extra rebuild areas to drop down from RAID6 to RAID5) or if you want to have some leeway to batch up replacing failed drives.