Software RAID in 2.4

Neil Brown

Abstract: With 2.4.0, software RAID in Linux will come of age. The functionality that previously was only available through a separate patch set will finally be part of a ``Stable'' kernel release. However for much (hopefully not all) of the 2.4.0-test series, RAID5 performance was extremely poor.

This talk will look at how differences between 2.2 and 2.4 affected software RAID, and how the various issues were (finally) dealt with. In particular, it will look at RAID5 performance and what slowing things down.

1 Introduction

The Linux Kernel contains a device driver call ``MD'' for ``Multiple Disk''. It allows several disc drives (or other devices) to be joined together to give the appearance of a single devices that is either larger, or more reliable, or both. The virtual device presented is often referred to as a RAID or Redundant Array of Independent Drives.

When the then-latest version of the MD driver (also referred to as Software Raid) was incorporated into the 2.3 development series of the Linux kernel, it had some functionality problems and the RAID5 module also had substantial performance problems. These were largely due to changes in other parts of the Linux kernel that were made early in the 2.3 series.

The rest of this paper takes a brief look at the history of Software Raid in the Linux kernel, and then looks at the specific changes that causes problems, why they caused problems, and what was done about it.

2 History

The MD (multiple disk) driver which provides the Software RAID functionality first appeared in the Linux kernel in release 1.3.69. This was version 0.34 of the MD driver and only supported the RAID0 and linear (drive concatenation) personalities. The version number increased to 0.35 in 1.3.93 with some substantial changes to the interface between the MD driver and the block-device subsystem (ll_rw_blk.c).

In release 2.1.63, and then 2.0.35 (for the stable series), RAID1 and RAID5 personalities were added as part of an MD driver upgrade to 0.36. The modules provided basic support for these RAID arrays including survival of a drive failure, but they did not include online rebuilding of redundancy. To rebuild a drive or check parity for RAID5, the array would have to be taken off-line, and a user-level program run to for the work.

For some time after this, Ingo Molnar maintained a set of RAID patches separate from the main kernels. This patch set gained many improvements including on-line rebuild, a more robust super-block format, and other structural improvements. However it was never included in the 2.2 series kernels, partly because it made some changes to the interface with user-level tools which was not backwards compatible. Nevertheless, many people used this patch set and were very happy with it. When discussing 2.2 RAID performance below, it will always be with reference to 2.2 with this patch set applied. This RAID code is version 0.90 of the MD driver.

Much of Ingo's work was incorporated into the development kernel in 2.3.43. However on-line rebuilding did not work for reasons which are explained below. In 2.3.99pre8, new rebuild code that I had written was incorporated, and a fairly mature RAID subsystem was finally available in a kernel.org kernel.

Since them, numerous small improvements and bug fixes have gone in, and it is very likely that we will have a fully functional and efficient RAID subsystem in 2.4.0-final.

3 The buffer cache gives way to the page cache

One of the significant changes between 2.2 and 2.4 happened quite early in the 2.3 development series. This involved the cache in which filesystems stored data.

In 2.2, filesystem data was stored in the buffer cache. This cache is indexed by device number and physical block on that device. To find a particular block of a file in the buffer cache, you have to look up the file's meta-data to find where on the device that block is, and then look up that address in the buffer cache.

In 2.4, filesystem data has largely been moved into the page cache. The page cache is indexed by an ``address_space'' and an offset within that address_space. An address_space typically corresponds to a file. Each inode contains a pointer to the associated address space. This allows direct access to a particular block of a file, rather than having to look up addresses in the file's meta-data.

Some filesystem data, particularly meta-data does not fit neatly into the address_space concept, and this is still kept in the buffer cache. However meta-data is only a small fraction of the data that a filesystem caches, so in 2.4 very little filesystem data is stored in the physically addressable buffer cache.

This has two effects on the RAID code, one that relates to RAID5 only, and one that relates to both RAID1 and RAID5 --- the two RAID personalities that provide redundancy.

3.1 RAID5 snooping for write performance

In RAID5, blocks of data are logically gathered, one from each device, into stripes. For each stripe, one block is deemed the ``parity'' block. All other blocks store real data. The parity block stores a logical ``exclusive-or'' of the data blocks. Thus if any one device is lost, the data that was on that device can be discovered using the data on the working drives, and the parity block.

Thus, whenever updating any data block, the parity block in the same stripe must be updated as well. A bit simplisticly, this leads to two physical up-dates for each logical update.

If several logical updates within the same stripe can be gathered together and performed at the same time, then only one parity update is needed for the whole stripe. Though this is a bit of an over-simplification, it is clear that gathering updates for a given stripe together will improve write performance.

The RAID5 code for 2.2 did this by looking in (snooping on) the buffer cache. As the buffer cache is physically addressed, when RAID5 needed to write a block in a given stripe, it could first check the buffer cache for any other blocks in that stripe which were marked dirty --- but not yet scheduled for writing --- and proactively write them. This lead to substantial speed-ups in write throughput.

While the buffer cache is still present in 2.4, and the RAID5 code can still look in it for dirty data, not much is actually stored in the buffer cache, so RAID5 will not often find anything.

This means that an optimisation that was very effective in 2.2, has become effectively useless in 2.4. A different approach to clustering writes needs to be found.

3.2 Rebuilding redundancy through the buffer cache

The second way that the RAID code for 2.2 utilised the buffer cache was when rebuilding redundancy, whether copying a mirror to a spare, or re-calculating parity on a RAID5 array, or when reconstructing a failed drive onto a spare for RAID5.

The basic approach that was taken for rebuilding redundancy was to read and re-write the entire array. The read process would find the correct data, either by reading from a valid drive or by reading all other drives in a RAID5 stripe and calculating the missing block. The write process would then update all drives appropriately.

During this process there is obviously a time lag between reading the data and writing it back out again. If, during that time lag, an active file system that was using the RAID array chose to write out new data, there is a potential for the old data written by the rebuilding process to over-write the new data written by the filesystem. That would obviously be a problem.

In 2.2, this potential problem was not realised because the buffer cache provided the necessary synchronisation. If a filesystem chose to write to a block that was being used in the rebuild process, it would write to exactly the same buffer that the rebuild process was using, so there would be no risk of over-writing new data.

At least, that was true for filesystems. For RAID arrays used for swapping the same synchronisation was not available. Even in 2.2, swap I/O did not go through the buffer cache, so swapping to a RAID array that was in the process of rebuilding could cause data loss. This lead to a strong recommendation that RAID NOT be used for swap, or if it was, swap should not be enabled on a RAID array that was being rebuilt.

For 2.4, the rebuild process has been substantially re-written. Much of the detail of the process has been pushed down into the individual personalities (RAID1.c and RAID5.c) which are in a better position to synchronised things properly.

For RAID1, the possibility of over-writing new data with old is avoided by creating a small sliding window in which the re-build happens. Normal I/O that is targeted within the window is blocked until the window has moved on. The leading edge of the window does not progress until it can be sure that there is no outstanding I/O requests in the next section, and the trailing edge does not progress until all rebuild requests in the tail section have completed.

For RAID5, the synchronisation was somewhat easier to achieve, though possibly not as easy to describe. RAID5 maintains a ``stripe cache'' which buffers some or all of the data in recently accessed stripes. Each new I/O request is attached to the appropriate stripe, and that stripe is handled as a whole. In the same way, rebuild requests are attached to the appropriate stripe, and all reading, writing, and rebuilding of a stripe are handled together, thus avoiding possible conflict.

4 The Big Kernel Lock gives way to finer grains

The second big change that came during the 2.3 development series was the reduction in using the Big Kernel Lock (BKL) to protect code against concurrent execution on SMP machines.

In 2.2, all file I/O was, by default, executed under the BKL so that there were no SMP related concurrency issues. All that changed in 2.3, and it became required for each filesystem and each device driver to do its own locking as required.

This was not much of an issue for the linear, RAID0, and RAID1 personalities, as they treat each request independently. For the most part, the only data-structures that are shared by different requests are read-only data-structures that describe the whole array. Even where shared write access is required, such as allocating buffer_heads from a pool in RAID1.c, a simple spin lock around the code is adequate.

However with RAID5, the situation is much more complex. As mentioned earlier, RAID5 maintains a cache of recently accessed stripes, and attaches new requests to the appropriate stripe. This means that access to each stripe needs to be suitably protected with a lock of some sort.

Further, each stripe may be in one of various states, possibly reading old data to perform a parity calculation, or reading data to service requests, or writing out new data and parity. Establishing the best model to accommodate all of these states and all possible transitions is not straight forward.

At the time when all these changes were happening in 2.3, Ingo Molnar, the maintainer of the RAID patches for 2.2, was busy with other interesting things, particular the Tux http server. Though he obviously wanted to keep RAID5 working in the 2.3 kernels, he did not (as far as I can tell) have the time to give it the attention that it needed. While he managed to successfully port the RAID code into 2.3, the locking scheme for the RAID5 stripe was imperfect, as we shall see in the next section.

5 What went wrong with RAID5

Though the software RAID code was successful ported to the 2.3 development kernels, and was included in the kernel.org releases from about 2.3.50 onwards, performance for RAID5 was quite poor. In this section I will look at RAID5 performance, why it was poor, and my experience in trying to improve it.

But first, it will pay to look a bit at how I chose to measure performance.

Ultimately, the only measure of performance that is important is ``How quickly does it do the job that I want it to do?'' However that is hard to measure or control reliably. Instead, I followed that example of Gary Murakami¹ and ran a few simple tests over a variety of configurations. This allowed me so see how different aspects of the configuration affected different aspects of performance.

The different configurations that I tested were arrays with between 2 and 7 drives, and using a chunk size ranging from 4k to 256K in powers of 2. For each of these configurations I measured sequential read and write performance using bonnie and random filesystem performance using dbench. I did this on both RAID5 configurations and RAID0 configurations. As RAID0 uses a similar layout to RAID5, it provided a means to isolate performance issues that were specific to RAID5, rather than general to the hardware.

To give an idea of the sort of data that resulted, figure 1 shows a graph of RAID0 sequential read throughput for 2.4.0-test10 using varying chunk sizes, and figure 2 shows RAID0 and RAID5 sequential write throughput for 2.4.0-test10 and 2.2.18 plus RAID patches for two different chunk sizes.

Figure 1: RAID0 Read Throughput in 2.4 with various chunk sizes

Figure 2: RAID0 and RAID5 Write throughput in various kernels with a 4K and a 64K chunk size

5.1 Over contentious locks

As can be seen from figure 2, while RAID0 write performance has gone up for 2.4, RAID5 write performance has dropped down badly. Further, while the performance for all other configurations tends to improve with more drives, the performance for RAID5 in 2.4 is either stable or drops off slightly for more drives.

A large part of the reason for this is the over cautious stripe locking that was introduced in the 2.3 port of the RAID5 code. The effect of this locking on the RAID5 code was that as soon as one I/O request was attached to a stripe, that stripe was locked until the request had been completely dealt with. This substantially reduced opportunities for parallelism between the multiple drives.

For write requests, the RAID code would still snoop on the buffer cache to find more buffers to write in a single stripe, but for reasons mentioned above, it would almost never find them. This essentially meant that RAID5 would write a single chunk, and then wait for that chunk to complete before writing the next chunk. Read performance is affected in a similar way for similar reasons.

The first performance improvement I needed to do was to free up the locking on the stripe cache.

Though it is hard to be certain, I believe that the reason that it had been difficult to get locking right on the stripe cache was related to the explicit state-machine nature of the code.

When writing to a RAID5 array, a number of steps are needed to complete the write:

begin-write: This is the initial state when buffers with pending write requests are attached to the stripe.
read-old: This is an optional, though common, state where old data or parity blocks are read from the devices to enable parity calculations.
write: In this state, data and parity is actually being written to the devices.
complete: This is the final state of writing when the write requests are returned to the callers.

These states for writing, and similar states for reading, were encoded using two per-stripe variables, cmd which could be STRIPE_NONE, STRIPE_WRITE, or STRIPE_READ and phase which could be one of PHASE_BEGIN, PHASE_READ_OLD, PHASE_WRITE, PHASE_READ, or PHASE_COMPLETE.

Having explicit state like this suggested that well defined state transitions would be appropriate. This means that new requests could not simply be added at any time, as that would require an irregular state change. I believe that this fact made the model for locking the stripes too complicated.

The approach that I took was to get rid of the explicit recording of state altogether. There is still a number of states that needed to be handled, and a number of state transitions, but these are now less explicit, and more natural.

The state of a stripe is implied by the state of all the cache buffers, whether they are up-to-date, or dirty, or locked. Each time a stripe is considered for action, the next action is determined by the composite state of the various buffers. This means that new requests can be accepted at any time, except during very small window that is protected by a spinlock.

The result of these changes is that read throughput is much improved, and that write throughput is somewhat improved, as can be seen in figure 3.

Figure 3: RAID5 Read and Write Throughput for 2.2, 2.4, and patched 2.4

5.2 No more snooping

Both the read and write throughput increased markedly with the fixing of the locking of the stripe cache. This is not surprising as it is now possible for multiple requests on a single stripe to be handled concurrently. However the throughput still hasn't reached 2.2 levels.

The shortfall in read throughput for larger numbers of drives (>3) will be dealt with in the next section. The shortfall in write throughput --- the topic of this section --- is due to the loss of buffer cache snooping.

As mentioned earlier, buffer cache snooping in 2.2 allowed RAID5 to often collect a whole stripe worth of write requests together and to process them as a whole. This greatly improves performance as no pre-reading phase is needed. In 2.4, this is not possible, and a different strategy is needed.

The need to collect several requests together to make a full stripe before commencing I/O is similar to the need to collect multiple contiguous requests together when accessing a normal rotating disc drive, to allow full tracks (or at least, larger extents) to be read in a single request. This is currently achieved with an approach known as plugging.

For normal drives, a new request on an idle drive causes the drive queue to be ``plugged''. This means that the request is queued normally, but that all processing of the queue is temporarily disabled. This allows multiple requests to be collected together and merged so that fewer, larger requests get created.

At some point, the queue needs to be unplugged, and this is currently done in Linux using a simple ``un-queue-all-devices'' call whenever a kernel thread needs to wait for a specific buffer to be completed, or when it is noticed that too much memory is being used by buffers with pending I/O.

If a device is busy, then it is expected that newly added requests will not be processed for a little while. This means that there will be a natural delay in processing requests which will allow neighbouring requests to be merged.

This concept can translate to RAID5, but not perfectly. As an initial test of the idea, I implemented a simple plugging scheme where-by stripes with write requests added to them were added to a separate queue which would only be processes after an unplug call. As the figure 4 shows, this made a substantial difference for write throughput to arrays with a 4K chunk size. It made some difference for other configurations, but not as much, as figure 5 shows. The key feature about the 4k-chunk configuration is that it consecutive requests are likely to get attached to the same stripe (given a 4K filesystem and hence a 4K stripe size). Other configurations need to collect requests on more than two stripes at once, and this didn't seem to happen very often. I believe this is because the queue was being unplugged too often.

Figure 4: RAID5 Write Throughput showing the effect of plugging with a 4K chunk size

Figure 5: RAID5 Write Throughput showing the effect of plugging with a 8K chunk size

One cause of frequent unplugging is the RAID5 code itself. The RAID5 code maintains a fixed size stripe cache. Whenever a request arrives for a stripe that is not in the cache, it much wait for a stripe to become available. Before waiting, it must unplug any underlying devices. Because of the paucity of the interface of unplugging devices, it must unplug all devices, including itself! This means that for every request that goes to a new stripe, any previous stripes will be unplugged.

There are strategies available to reduce the problem of frequent unplugging. One is to free-up stripes in large groups, instead of just one at a time. This means that a series of requests can arrive for several new stripes, and they can all be served without forcing a unplug.

Another strategy which holds more promise I believe is to simulate the delay of a busy drive.

Currently as soon as a stripe has a request attached to it, it will cause a matching request to be submitted to lower level devices (unless it is a plugged write request). This is done without any regard to how busy the underlying device is, as RAID5 has no way to know. If, however, we limit the number of stripes which can have outstanding requests on lower level devices to, say, half of the cache, then we will get a natural delay in processing that is independent of plugging (which as we have seen is not alway effective).

This strategy has not yet been implemented.

5.3 Loss of zero copy reads

The shortfall in read throughput observed in figure 3 only affects arrays with more than 3 drives, and seems to be general lowering of the ceiling imposed by some bottle neck in the whole system.

It will be helpful at this point to consider the relationship between read throughput for RAID0 and RAID5. The read throughput of a fully working RAID5 array should be comparable to that of a RAID0 array, except for the effect of the parity blocks. Though the parity blocks do not have to be read, they do have to be skipped over by the drive head, and so they do affect the throughput. In general for an N drive array, the RAID5 read throughput should be about the N-1/N of the RAID0 read throughput.

Figure 6 shows that this is true for small values of N, it quickly ceases to be true as N grows.

The main difference between the handling of reading in RAID5 and in RAID0 is the stripe cache in RAID5. The RAID5 read must both wait for a free stripe, if there isn't one, and must copy the data out of the cache and into the target buffer on completion. Both of these could slow things down.

I implemented a simple cache-bypass scheme for read requests on stripes that were not engaged in writing or synching and got throughput must closer to the theoretical goal as shown in figure 6. This shows that memory-to-memory copying was slowing things down for high throughputs.

Figure 6: Read throughput comparing RAID0 and RAID5 and particularly showing the effect of bypassing the cache for reading

The original RAID code already treated read requests this way, and this functionality was lost when I freed up the locking on the stripe cache, so it is not surprising that it needed to be put back.

However I am quite happy that I left it out and then put it back in later as

it meant I could have an objective measure of how important it was and
I am fairly sure that the code was much neater than it would have been had I tried to include it from the beginning. It is a special case fast-path and is best treated as an optional extra that is only used when convenient.

6 Future directions

With the changes described above, software RAID, and RAID5 in particular, is quite usable in 2.4 kernels and as usable or more so than the software RAID in patched 2.2 kernels. If the scheme for clustering writes by limiting the number of concurrent lower level requests works, this will be especially true.

However there is more work that can be done to make soft RAID work even better. Some directions are outlined below.

6.1 Partitions

Most storage devices can be partitioned into separate sub-devices to allow convenient storage management. However RAID devices cannot currently be partitioned. It is true that subdivision can be done using LVM, the logical volume manager. However that approach is more heavy weight that traditional partitioning, and may not always provide what is required.

A particular use for partition is for providing a mirrored pair of devices as the boot device. A typical boot device contains a partition for the root filesystem, a partition for swap, and another partition for /usr or similar. It would be nice to be able to mirror the whole boot device onto another device, and then partition the whole in the same way that a single device is partitioned. This would particularly make adding a mirror to an already configured machine much easier.

I have initial patches to do this, but more work is needed.

6.2 Reorganise recovery

Currently the recovery processes is a bit awkward.

Recovery after an unclean shutdown is handled quite differently to recovery after a failed drive.
Only one RAID array in a system can recover from a failed drive at a time.
It is not possible to have a spare drive that can be attached as needed to either of two RAID arrays.

None of these are particularly hard issues to solve, but they point to an untidiness in the implementation that should be dealt with.

6.3 RAID5 with variable buffer sizes

Currently the stripe size of the RAID5 stripe cache is set to the block size of the filesystem that is mounted on it. If the block size changes (as it does, for example, when first mounting the filesystem) the stripe cache is flushed and rebuilt.

This assumes that a filesystem has a fixed block size. However many newer filesystem would benefit from not having a fixed block size and it is likely that Linux will drop this restriction soon. When it does, RAID5 will be left out in the cold.

Something needs to be done to allow requests of various sizes to be handled by RAID5. This means having a fixed stripe size for the stripe cache, and either mapping all requests to this size, pre-reading for small writes, and over-reading for small reads, or allowing each buffer in a stripe to be only partially valid.

This is as yet completely unimplemented.

6.4 Random other stuff

Currently, some status information is made available in /proc/md. It would be nice to create a directory structure instead which made it easier to access and parse information about different devices.
Currently, writes to the RAID super block are NOT CHECKED, and a write error there will not be noticed.
There is a lot of un-necessary data stored in the super block. It would be possible to choose a new superblock layout that is only 1K (instead of 4K) and allows for hundreds of devices per array instead of 29.

1: See http://www.research.att.com/gjm/linux/ide-i75RAID.html

This document was translated from L^AT_EX by H^EV^EA.