World Playground Deceit.net

GNU parallel out, pararun in


Another third-party tool removed from my toolbox. And not any small bauble, a 16k lines Perl script that often gave me headaches: GNU parallel.

I actually made three (3) replacements because I was incredibly bored:

  • pararun: a wrapper around xargs -P to provide features like exit on job error or progress reporting.
  • pararun -m: the same wrapper using make -j (POSIX-2024) as command runner.
  • pararun_portable: an exercise in futility, more precisely a portable and standalone (without my portability shim) version of the previous script, using only the shell job control and a FIFO.

The resulting syntax is in my opinion much clearer, since it's just an explicit sh command. These days, I much prefer the now familiar "quoting hell" to "automagic, DSL and many-ways-to-do-stuff hell", especially when I have to debug it. Worse is truly better, in some cases.

$ bmps=$(find dir/ -type f -name '*.bmp')
$ echo "$bmps" | parallel magick {} {.}.png
$ echo "$bmps" | pararun 'magick "$1" "${1%.*}".png'

Parallel's --keep-order flag can be easily emulated via the JOBNUM variable (equiv. to parallel's {#}) made available to the command:

$ parallel -j4 -k sleep {}\; echo {} ::: 2 1 4 3
$ printf '%s\n' 2 1 4 3 | pararun -j4 'sleep $1; echo $JOBNUM $1' | sort -k1n | cut -d' ' -f2-

In fact, here's the final wrapper: paramap

The comparison stops here because, yes, parallel has much more features. Especially its remote job distribution via ssh, which would be a lot more work to duplicate correctly.

Or does it? Using the make backend with GNU make (and bmake, if I understand their job token pool thing correctly) brings something I have always wanted: jobserver orchestration for nested instances, to never let a CPU core go unused.

Worth mentioning that this backend has an inconvenient: make - only starts working after stdin is closed. This "work" includes parsing and dependency resolution, which means a small startup overhead depending on the number of jobs . Said overhead isn't massive, but it's there:

$ seq 1000 | time pararun ''
pararun ''  0.21s user 0.12s system 177% cpu 0.186 total
$ seq 1000 | time pararun -m ''
pararun -m ''  1.49s user 0.77s system 186% cpu 1.211 total