World Playground Deceit.net

Bourne shell stdio trivia

Published on
Tags: sh, programming

My first post! Not a very groundbreaking one, but hey, we must all start somewhere. So here are a few gotchas related to the handling of standard streams in sh(1) scripts.

NB: all of the following scripts are POSIX.1-2024 compliant unless noted otherwise.

while read loops and interactive commands §

Let's start with a classic! Picture this, you're reading a list of files on stdin to replace their extension, but you know mistakes happen, so you use your trusty mv -i to do the job:

find -type f -name '*.zip' -print0 | while IFS= read -r -d '' f
do
    mv -i -- "$f" "${f%.zip}.cbz"
done

And like any diligent scripter, you test your new creation. Here's a timeline of the typical debugging session ensuing:

  1. “Huh? Why does my loop end at the first conflict?”
  2. Check what goes into the while with a judiciously placed tee /dev/stderr but everything looks A-OK.
  3. Suppression of the global set -e you always use (remember, I said diligent) with a || true or set +e; …; set -e wrapper before a hopeful second try.
  4. No cigar! “Damn, what could it be?”
  5. The nuclear option, set -x, doesn't help much.
  6. 10 minutes, then 30 minutes then 1 hour of frantic "echo debugging", running snippets in different contexts, hair-pulling and Stack Overflow perusing.
  7. Despair.

At this point, either you ragequit or discover the truth by miracle: the while expression and its body share stdin! And mv -i reads from stdin when seeking your input.

In fact, it completely drains stdin, not even stopping at the first newline although it only needs a single getline (we'll see later why). Fact easily demonstrated by a such a test: touch a b; seq 10 | { mv -i a b; cat; }.

So, how do we solve this? Well, there are two ways I can easily cite on the top of my head:

# 1. Read the answer from the terminal itself (yep, /dev/tty is POSIX)
find -type f -name '*.zip' -print0 | while IFS= read -r -d '' f
do
    mv -i -- "$f" "${f%.zip}.cbz" </dev/tty
done

# 2. Don't use stdin in the first place!
# IFS split on newline
oldIFS=$IFS
IFS='
'
for f in $(find -type f -name '*.zip')
do
    mv -i -- "$f" "${f%.zip}.cbz"
done
IFS=$oldIFS

# Or use find's -exec with a subshell
find -type f -name '*.zip' -exec sh -c 'mv -i -- "$1" "${1%.zip}.cbz"' argv0 {} \;

# Same without forking for each item
find -type f -name '*.zip' -exec sh -c '
    for f
    do
        mv -i -- "$f" "${f%.zip}.cbz"
    done' argv0 {} +

Here's one "fun" and very unexpected such case that cost me a few points of sanity some time ago, what in France you could call "double effet Kiss Cool": ffmpeg reads from stdin even without being told to, to provide keybinds during encoding. And from memory, it reads unbuffered, so you only lose a few crucial characters here and there.
tl;dr: always use ffmpeg -nostdin if you don't fancy pain.

Default stream buffering §

This issue isn't very well-known and for good reasons: the heuristic involved is pretty sensible so the corner cases are rarely encountered. Let's look at two of them first things first:

trickle()
{
    while true
    do
        echo "$1"
        sleep 1
    done
}

slurp()
{
    while IFS= read -r line
    do
        echo "slurped: $line"
    done
}

trickle foo | tr 'a-z' 'A-Z' | slurp

Running this (at least on GNU, busybox is fine, for example), shouldn't show any output unless you wait for a reaaally long time.

The second example is in fact none other than what we saw earlier with mv -i consuming the full stdin without apparent reason!



So, any idea why? Well, it's a pretty low-level reason really: the initial buffering mode for standard streams (stdin, stdout, stderr) is specified in such a way that programs read (resp. write) their data in big chunks instead of line by line if they can determine that stdin (resp. stdout) aren't terminals (e.g. pipes); likely via the isatty(3) function.

Basically, such buffering trades latency for throughput, which is considered a good deal in the vast majority of non-interactive cases.

Now for the fix! As far as I know, there is no POSIX way of defeating this "feature" when you need to, but at least two utilities exist for that purpose: stdbuf and unbuffer. I'll point to this in-depth post for more details about these, to let me summarize so:

  • Our first issue can be fixed by using stdbuf -oL tr or unbuffer -p tr.
  • And the second with stdbuf -i0 mv -i (stdin was being fully buffered).

You might wonder how one does encounter this in the wild? Well, in my case, it's a pretty simple tale of scrapping of this kind:

curl http://foo.bar/gallery.html |
    extract_page_urls |
    while page_url
    do
        curl "$page_url"
    done |
    extract_img_urls |
    while img_url
    do
        curl -O "$img_url"
    done

At one point I wondered why the final downloading took so long to start…

Unrelated disappointing surprise I encountered that day: I thought curl -K - would let me download asynchronously while still benefitting from significant HTTP pipelining gains. Nope, it only starts after reading the whole config =(.


Thanks for reading and I hope some of this article was interesting! To be honest, I had to find something to start my blog before putting my website online, so I don't blame you if you find this pretty lacking.