initramfs, udev, mdadm md arrays, and external bitmaps

MD arrays are great for data integrity, but if you ever actually utilize this attribute, you may be in for a very long array re-synchronization. Unless, that is, you have a write-intent bitmap configured. If you do, then the system can quickly find the few places that may be out of sync and fix them. This is particularly the case for RAID 5 or 6.

mdadm lets you add the write-intent bitmap as either internal or external as a file on an ext filesystem. Performace tests show that external bitmaps can be up to 5x faster (random write IOPS), especially if written to SSD or NVME storage.

The Problem

To assemble an MD array with an external bitmap, the filesystem containing the bitmap must already be mounted read-write. Presumably, this is the root filesystem, but needn't be. Back in the old days of carefully-ordered init scripts, this worked fine. Now, not so much.

Enter systemd, which makes the world more dynamic and asynchronous. Systemd forced us to have /usr mounted in order for user-space startup to proceed. So, now, we build a kernel with a companion initramfs (initrd) with all the bootstrap stuff from / and /usr and whatever else is needed to get the system going.

Things may vary, but on Debian bullseye, boot proceeds like this: The kernel loads the initramfs and mounts it at / before turning over control to the /init script. The initrd instance of systemd-udevd wants to activate all the block devices (including MD arrays, LVM logical volumes, USB sticks, etc.) before the init script tries to mount local filesystems. In the specific case of MD, there is a udev rule that runs mdadm --incremental on every component device.

This is the chicken-and-egg problem: MD needs to access the bitmap on a mounted filesystem, or else it will fail and leave the array 'inactive'. But, no filesystem will be mounted until MD assembly is attempted. Eventually, the initramfs sequence will mount the root filesystem, but it will be at /root which makes the path to the bitmap different than what's specified in /etc/mdadm/mdadm.conf.

A Solution

In many cases, the MD array is not actually needed for early startup. A simple solution can be found at LabFruits. The idea is to get the initrd to completely ignore the MD array. It will be handled by systemd once / and /usr are mounted in their final places and the ramdisk environment is gone. Systemd provides many tools for managing dependencies and is tightly integrated with udev. Note that putting AUTO -all into /etc/mdadm/mdadm.conf did not have the desired effect.

Here is a more conservative solution than the LabFruits way. Install the following script as /etc/initramfs-tools/hooks/mdadm-nuke.

Then run update-initramfs -u -k kernel_version in order to regenerate the ramdisk. You can verify the contents via lsinitramfs. It's best to keep at least one good kernel and/or initrd.img around for rescue booting.

With this new setup, udev in the initramfs is oblivious to MD arrays and mdadm. Then, as soon as the real systemd on the real root takes over, the udev rules for MD arrays spring back to life. The MD arrays are assembled, LVs are discovered, and filesystems are mounted and checked. It's almost like magic.

Followup Details

What remains is to make sure that services such as mailservers, webservers, and databases don't start before the filesystems they need are mounted. An easy way to do this is to run, for example: systemctl edit apache2.service and then enter:

The end result is to create a directory and file like this: /etc/systemd/system/apache2.service.d/override.conf, but you're free to create the directory yourself and have .conf files with whatever names you like.

The good news is that all of this can be achieved via changes in /etc. No package-provided files need to be changed.

Tue May 16 23:15:03 2023