Bourne shell stdio trivia
My first post! Not a very groundbreaking one, but hey, we must all start somewhere. So here are a few gotchas related to the handling of standard streams in sh(1) scripts.
NB: all of the following scripts are POSIX.1-2024 compliant unless noted otherwise.
while read
loops and interactive commands §
Let's start with a classic! Picture this, you're reading a list of files on stdin to replace
their extension, but you know mistakes happen, so you use your trusty mv -i
to do the
job:
find -type f -name '*.zip' -print0 | while IFS= read -r -d '' f do mv -i -- "$f" "${f%.zip}.cbz" done
And like any diligent scripter, you test your new creation. Here's a timeline of the typical debugging session ensuing:
- “Huh? Why does my loop end at the first conflict?”
- Check what goes into the
while
with a judiciously placedtee /dev/stderr
but everything looks A-OK. - Suppression of the global
set -e
you always use (remember, I said diligent) with a|| true
orset +e; …; set -e
wrapper before a hopeful second try. - No cigar! “Damn, what could it be?”
- The nuclear option,
set -x
, doesn't help much. - 10 minutes, then 30 minutes then 1 hour of frantic "echo debugging", running snippets in different contexts, hair-pulling and Stack Overflow perusing.
- Despair.
At this point, either you ragequit or discover the truth by miracle: the while
expression and its body share stdin! And mv -i
reads from stdin when seeking
your input.
In fact, it completely drains stdin, not even stopping at the first newline although it only
needs a single getline (we'll see later why). Fact easily demonstrated by a such a test: touch a b; seq 10 | { mv -i a b; cat; }
.
So, how do we solve this? Well, there are two ways I can easily cite on the top of my head:
# 1. Read the answer from the terminal itself (yep, /dev/tty is POSIX) find -type f -name '*.zip' -print0 | while IFS= read -r -d '' f do mv -i -- "$f" "${f%.zip}.cbz" </dev/tty done # 2. Don't use stdin in the first place! # IFS split on newline oldIFS=$IFS IFS=' ' for f in $(find -type f -name '*.zip') do mv -i -- "$f" "${f%.zip}.cbz" done IFS=$oldIFS # Or use find's -exec with a subshell find -type f -name '*.zip' -exec sh -c ' mv -i -- "$1" "${1%.zip}.cbz"' argv0 {} \; # Same without forking for each item find -type f -name '*.zip' -exec sh -c ' for f do mv -i -- "$f" "${f%.zip}.cbz" done' argv0 {} +
Here's one "fun" and very unexpected such case that cost me a few points of sanity some
time ago, what in France you could call "double effet Kiss Cool": ffmpeg
reads from stdin even without being told to, to provide keybinds during encoding. And from memory,
it reads unbuffered, so you only lose a few crucial characters here and there.
tl;dr: always use ffmpeg -nostdin
if you don't fancy pain.
Default stream buffering §
This issue isn't very well-known and for good reasons: the heuristic involved is pretty sensible so the corner cases are rarely encountered. Let's look at two of them first things first:
trickle() { while true do echo "$1" sleep 1 done } slurp() { while IFS= read -r line do echo "slurped: $line" done } trickle foo | tr 'a-z' 'A-Z' | slurp
Running this (at least on GNU, busybox is fine, for example), shouldn't show any output unless you wait for a reaaally long time.
The second example is in fact none other than what we saw earlier with mv -i
consuming the full stdin without apparent reason!
So, any idea why? Well, it's a pretty low-level reason really: the initial buffering mode for standard streams (stdin, stdout, stderr) is specified in such a way that programs read (resp. write) their data in big chunks instead of line by line if they can determine that stdin (resp. stdout) aren't terminals (e.g. pipes); likely via the isatty(3) function.
Basically, such buffering trades latency for throughput, which is considered a good deal in the vast majority of non-interactive cases.
Now for the fix! As far as I know, there is no POSIX way of defeating this "feature" when you need to, but at least two utilities exist for that purpose: stdbuf and unbuffer. I'll point to this in-depth post for more details about these, to let me summarize so:
- Our first issue can be fixed by using
stdbuf -oL tr
orunbuffer -p tr
. - And the second with
stdbuf -i0 mv -i
(stdin was being fully buffered).
You might wonder how one does encounter this in the wild? Well, in my case, it's a pretty simple tale of scrapping of this kind:
curl http://foo.bar/gallery.html | extract_page_urls | while page_url do curl "$page_url" done | extract_img_urls | while img_url do curl -O "$img_url" done
At one point I wondered why the final downloading took so long to start…
Unrelated disappointing surprise I encountered that day: I thought curl -K -
would let me download asynchronously while still benefitting from significant HTTP pipelining gains. Nope, it only starts after reading the whole config =(.
Thanks for reading and I hope some of this article was interesting! To be honest, I had to find something to start my blog before putting my website online, so I don't blame you if you find this pretty lacking.