World Playground Deceit.net

NAS setup logbook


After two weeks, my NAS is finally complete (almost, still need to setup NUT). You might think something like two weeks to build a NAS?! but it was homemade with love, no TrueNAS/OMV hand-holding was used here. That and I had to wait for hardware deliveries.

I wanted to write an autistic post about the long process and its minutiae, so here it is.

Hardware §

Inside the NAS

I already had a "home server" in the shape of an old repurposed workstation: AMD R5 2600, an ASUS motherboard with 6 SATA ports, 8GB of ECC DDR4 in a old Fractal Design Define R4 (lots of drive bays). So I only had to buy new HDDs (5x 10 TB Seagate IronWolf Pro, not trusting WD after the SMR Red scandal) and SATA power cable extensions (to allow single disk hot-unplugging). And a small Eaton UPS (3S 700VA).

First time using AliExpress, got the exact cables I wanted for 0,80€/unit instead of 10€ (!) on (Sc)Amazon. Will shop here for that kind of small potatoes from now on.

PSU failure

In fine, even after doing everything listed below, I still had constant ATA link failures and even a disk ejected from the RAID. On a whim, I changed the PSU by a brand new one I had lying around (never paid for that 80+ Platinum unit, it was an order mistake, lucky bastard that I am) and everything got fixed! Safe to say that this was probably the root issue, not my old drives dying, in defense of Toshiba's honour.

The old one (bottom right of the photograph) was a decent Seasonic Gold model, but well, it had 6 years of age and saw a few power surges… the replacement is still OEM'd by Seasonic, heh.

Topology and filesystem §

$ ssh user@server lsblk /dev/disk/by-id/ata-ST10000NT001-3LY101_WP027C'??'
NAME                      MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINTS
sda                         8:0    0   9.1T  0 disk
└─sda1                      8:1    0   9.1T  0 part
  └─dm-integrity-WP027C1N 253:0    0   9.1T  0 crypt
    └─md0                   9:0    0  18.2T  0 raid6 /home/user/data
sdb                         8:16   0   9.1T  0 disk
└─sdb1                      8:17   0   9.1T  0 part
  └─dm-integrity-WP027C9L 253:3    0   9.1T  0 crypt
    └─md0                   9:0    0  18.2T  0 raid6 /home/user/data
sdc                         8:32   0   9.1T  0 disk
└─sdc1                      8:33   0   9.1T  0 part
  └─dm-integrity-WP027C98 253:2    0   9.1T  0 crypt
    └─md0                   9:0    0  18.2T  0 raid6 /home/user/data
sdd                         8:48   0   9.1T  0 disk
└─sdd1                      8:49   0   9.1T  0 part
  └─dm-integrity-WP027C38 253:1    0   9.1T  0 crypt
    └─md0                   9:0    0  18.2T  0 raid6 /home/user/data

Decided on a RAID 6 because I don't play around and don't need that much space. Why the strange dm-integrity+mdraid+XFS choice? Simple, I don't care about compression (multimedia formats are already compressed) nor online dedup (I just want cp --reflink), only self-healing. I had two other available choices: Btrfs, known for its unstable RAID 5/6 support, and ZFS.

NB: only use dm-integrity with a kernel ≥5.4, reason.

Why not ZFS?

Because looking at OpenZFS with the eyes of a C/C++ dev (who knows how tricky those are, unlike most sysadmins) unaffected by the hype tells me they're losing the fight against complexity, historical cruft and portability friction. In details:

  • They're still implementing complex and invasive features instead of focusing on stability.
  • Related, they don't seem to have any concept of LTS: the 2.2 series was abandoned as soon as 2.3 started being worked on (some brave soul is spearheading a 2.2.8 effort). Doubly important since Linux isn't backward compatible with external modules (so you can't "just use an old version").
  • It may be full of useful features (and by extension so many knobs to twiddle it'd give JVM tuning consultants a hard-on) but it's also a bug nest. And not small ones, critical bugs. A non-exhaustive list:Seriously, just search "OOM" or "crash" in their bug tracker and peruse a bit.
  • ZFS implements its own page cache (so no zero-copy sendfile(2)/splice(2)), scheduler (so no ionice(1)), mdadm/mdraid+LVM+dm+etc… it's basically the X.org of filesystems, for better and for worse.
  • Fragmentation isn't a solved problem (compounded by the lack of working fallocate(mode=0), supposedly a byproduct of CoW, though XFS doesn't seem to suffer as much, probably because it uses huge - thus dangerous - writeback buffers), and no defragmentation is possible (not even offline!) without "simply" rebuilding a pool and moving the old stuff to it.
  • I've rarely seen numbers to justify the ARC hype over Linux's built-in LRU. Would love to see modern benchmarks.

So thanks but no thanks, my mother always told me not to play with radioactive fire.

Step by step §

First, I reused the OS that was on my server: a trusty Gentoo (like on my desktop).

Preliminary tasks

  1. Update the ancient BIOS just to be sure and inspect the settings (disabled PBO, stupid on an always-on server).
  2. Check the kernel config for:
    • CONFIG_CRYPTO_CRC32C_INTEL (CONFIG_CRC_OPTIMIZATIONS ≥6.14)
    • CONFIG_MD_RAID456
    • CONFIG_DM_INTEGRITY
    • CONFIG_NFSD and CONFIG_NFSD_V4
    • CONFIG_MD_AUTODETECT=n, useless since our RAID is on top of dm.

SMART testing

Always vet your new hard drives with an extended SMART test and RMA the ones with errors.

$ for dev in /dev/disk/by-id/ata-ST10000NT001*
do
    sudo smartctl -t long "$dev"
done
…
$ for dev in /dev/disk/by-id/ata-ST10000NT001*
do
    sudo smartctl -l selftest "$dev" | tail -n+6
done
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%        13         -

Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%        13         -

Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%        13         -

Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%        13         -

Partitioning

You can use raw hard drives for RAID and dm, but since you can only replace (or add) drives in your array with some of equal or greater size and that manufacturers can (or used to?) have slight sector count discrepancies for the same advertised size, we create partitions with just a little space left out.

$ for dev in /dev/disk/by-id/ata-ST10000NT001*
do
    echo size=19532800000 | sudo sfdisk -Xgpt "$dev"
done

Here I chose 19532800000 sectors because it's a multiple of the physical/logical sector size ratio of modern drives that aren't 4K native (8) and leaves a reasonable ~35 MB out.

Create dm-integrity devices

We create dm-integrity devices on our new partitions, these'll work under the RAID to notify it of any broken block which will be then corrected according to the RAID's parity; that's the aforementioned self-healing.

$ tmux
# Split in 4
# Launch `sudo integritysetup format --integrity-bitmap-mode --sector-size 4096
#     --progress-frequency 30 /dev/sdX1` in each
# Detach
tmux screenshot
Das tmux blinkenlights

Any better idea for a less involved solution? Could probably write a simple script to spawn each, filter/label their progress and multiplex it in a single FIFO, then nohup sudo it, hmmm.

Create the RAID array

We open (mount) our new dm-integrity devices:

$ for p in /dev/disk/by-id/ata-ST10000NT001*-part1
do
    tmp=${p#*_}
    name=dm-integrity-${tmp%-part1}
    sudo integritysetup open --integrity-bitmap-mode "$p" "$name"
done

Build the array and wait patiently for the initial sync to finish:

$ sudo mdadm --create --verbose --level=6 --raid-devices=4 /dev/md0 /dev/mapper/*
…
$ cat /proc/mdstat
Personalities : [linear] [raid6] [raid5] [raid4]
md0 : active raid6 dm-3[3] dm-2[2] dm-1[1] dm-0[0]
      19513348096 blocks super 1.2 level 6, 512k chunk, algorithm 2 [4/4] [UUUU]
      [=====>……………]  resync = 29.1% (2842051012/9756674048) finish=825.8min speed=139545K/sec
      bitmap: 52/73 pages [208KB], 65536KB chunk

unused devices: 

Create the filesystem

$ sudo mkfs.xfs /dev/md0
meta-data=/dev/md0               isize=512    agcount=32, agsize=152448128 blks
         =                       sectsz=4096  attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=1
         =                       reflink=1    bigtime=1 inobtcount=1 nrext64=1
         =                       exchange=0
data     =                       bsize=4096   blocks=4878337024, imaxpct=5
         =                       sunit=128    swidth=256 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1, parent=0
log      =internal log           bsize=4096   blocks=521728, version=2
         =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

Nothing more to do. The sunit and swidth parameters (underlying RAID stripe unit/width, important for performance) are fortunately autodetected when using software RAID, these days.

Setup OpenRC

This complex layered cake isn't going to automagically mount/unmount itself in the right order at startup/shutdown, so we need to setup our service manager to do so, OpenRC in my case.

Using the same order as during creation, first we need to mount our dm-integrity devices. This is where it really wasn't fun because there's no support in OpenRC for this! Whereas systemd has integritytab, we need to write something to replace it…

First, we look at some existing services in /etc/init.d (e.g. the very similar dmcrypt), then we use our honed POSIX sh writing skills and the official guide to bake ourselves a beautiful service:

Expand to see the service file (/etc/init.d/dm-integrity)
#!/sbin/openrc-run
# Devices configuration in /etc/conf.d/dm-integrity

echo() { printf '%s\n' "$*"; }
nth_arg_from_end() { local delta=$1; shift; eval echo "\${$(($#-delta))}"; }


depend() {
    use modules
    before checkfs fsck
    after dev-settle
}

start() {
    local status=0 i= dev= argline=

    # dev-settle is broken, we still need to manually wait for /dev/disk symlinks
    ewaitfile 10 $(echo "$integritysetup_open_args" | awk '! /^$/ {print $(NF-1)}')

    while IFS= read -r argline; do
        dev=$(nth_arg_from_end 1 $argline)
        ebegin "integritysetup open $argline"
        if ! [ -b "$dev" ]; then
            eerror "$dev: device not found even after waiting 10 s"
            continue
        fi
        integritysetup open --batch-mode $argline
        eend $? || status=1
    done <<EOF
$(echo "$integritysetup_open_args" | sed '/^$/d')
EOF

    if [ $status -eq 1 ]; then
        eerror "Failed opening some devices"
        return 1
    fi
}

stop() {
    local name=
    echo "$integritysetup_open_args" | awk '! /^$/ {print $NF}' |
        while IFS= read -r name; do
            ebegin "integritysetup close $name"
            integritysetup close "$name"
            eend $?
        done
}

status() {
    local name=
    echo "$integritysetup_open_args" | awk '! /^$/ {print $NF}' |
        while IFS= read -r name; do
            if ! [ -b /dev/mapper/"$name" ]; then
                eerror "$name not opened"
                return 1
            fi
        done
    return 0
}

Then we just have to configure it (via the /etc/conf.d/dm-integrity file):

# /etc/conf.d/dm-integrity: config file for /etc/init.d/dm-integrity

# Each line contains the arguments for an `integritysetup open` call; empty lines are ignored
integritysetup_open_args='
--integrity-bitmap-mode /dev/disk/by-id/ata-ST10000NT001-3LY101_WP027C1N-part1 dm-integrity-WP027C1N
--integrity-bitmap-mode /dev/disk/by-id/ata-ST10000NT001-3LY101_WP027C38-part1 dm-integrity-WP027C38
--integrity-bitmap-mode /dev/disk/by-id/ata-ST10000NT001-3LY101_WP027C98-part1 dm-integrity-WP027C98
--integrity-bitmap-mode /dev/disk/by-id/ata-ST10000NT001-3LY101_WP027C9L-part1 dm-integrity-WP027C9L'

NB: some of the complexity comes from the fact that I use those device symlinks setup by udev since it's the only unambiguous way to refer to physical drives; one could use PARTUUIDs, but I don't like it very much.

Then enable it by adding it to the early boot runlevel:

$ sudo rc-update add boot dm-integrity

Now for the RAID, we need to register the array for mdadm to know what to re-assemble later:

$ sudo mdadm --detail --scan | sudo tee -a /etc/mdadm.conf

Then declare the new dependency between our two services:

$ echo 'rc_need=dm-integrity' | sudo tee -a /etc/conf.d/mdraid

Before also enabling it:

$ sudo rc-update add boot mdraid

That's it!

Conclusion

Except the startling absence of integritytab equivalent, OpenRC's dev-settle thing being broken and the Gentoo wiki lacking in documentation about RAID and dm-integrity (always look at the Arch wiki too), this wasn't too bad.

Still painful for the non-computer wizards compared to ZFS' all-in-one approach. This is something that RedHat's stratis may fix in the future… crossing my fingers.

Omake 1: NFS §

I don't need encryption within my LAN, sshfs is slow and dead and Samba brings too much Windows brain damage along and can be a bit slow too. So NFS it is.

Never used it before, but it wasn't too hard. Well, it wasn't because I explicitely used the much saner NFSv4 everywhere (even disabled v2/3 at the kernel and USE flag level) which does away with many moving parts (rpc.statd, rpc.idmapd, rpcbind, sm-notify) and needs only one TCP port opening in firewalls.

Had a few problems with OpenRC being decrepit in that place too, but after a few hours, I had myself a squeaky clean and performant NFSv4.2 setup with server-side copy working.

NFS and reflinking

Even more impressive, I got server-side reflinking (copy-on-write copies to cavemen) for free. Behold!

$ mkdir share; sudo mount -t nfs -o rw server:/exports/data/ share/
$ echo foo >share/a
$ cp --reflink=auto share/a share/b  # --reflink=auto is the default since coreutils 9.0
$ ssh user@server is_reflink /exports/data/a /exports/data/b && echo '200% rad!'
200% rad!

PS: is_reflink is a small Linux-only script that can be found here

Apparently, anything that uses copy_file_range(2) Just Werks™ (in FreeBSD 13 too).

Omake 2: lf §

Now that all my media files are accessed through NFS and a GbE link, using the poorly programmed ranger and its UI blocking on I/O quickly became torture. Despite its long years of loyal services, I switched to lf during this week-end and am now thoroughly impressed by the performance and design cleanliness gap.

One consequence of the UNIX-ier design is that I was forced to improve my xdg-open script to replace ranger's rifle/sxiv-rifle.