I have a bash script that applies different transformation/mappings on columns of TSV file. I am trying to parallelize the transformations using GNU parallel, however my code hangs.
For simplicity consider cat, the identity mapper (i.e. input -> output), and a TSV file of three columns (generated on-the-fly using paste and seqs)
n=1000000
map=cat    # identity: inp -> out
rm -f tmp.col{1,2}.fifo
mkfifo tmp.col{1,2}.fifo
paste <(seq $n) <(seq $n) <(seq $n) \
    | tee >(cut -f1 | $map > tmp.col1.fifo) \
    | tee >(cut -f2 | $map > tmp.col2.fifo) \
    | cut -f3- \
    | paste tmp.col{1,2}.fifo - \
    | python -m tqdm > /dev/null
The above code works fine.
NOTE:
python -m tqdm > /dev/nullprints the speed
Next, we can parallelize the mapping tasks using GNU parallel's --pipe --keep-order arguments. Here is a minimal parallel example that works:
seq 100 | parallel --pipe -k -j4 -N10 'cat && sleep 1'
Now, putting all these together, here is my code that maps the TSV columns in parallel:
n=1000000
map=cat   # identity map: inp -> out
rm -f tmp.col{1,2}.fifo
mkfifo tmp.col{1,2}.fifo
paste <(seq $n) <(seq $n) <(seq $n) \
  | tee >(cut -f1 | parallel --id jobA --pipe -k -j4 -N1000 "$map" > tmp.col1.fifo) \
  | tee >(cut -f2 | parallel --id jobB --pipe -k -j4 -N1000 "$map" > tmp.col2.fifo) \
  | cut -f3- \
  | paste tmp.col{1,2}.fifo - \
  | python -m tqdm > /dev/null
This code was supposed to work, however, this code freezes. Why does it freeze and how to unfreeze it?
Environment: Linux 5.15.0-116-generic, Ubuntu 22.04.4 LTS on x86_64
It is a race condition with the fifos - not GNU Parallel
Assume this:
| tee >(cut -f1 | $map1 > tmp.col1.fifo) \
| tee >(cut -f2 | $map2 > tmp.col2.fifo) \
| cut -f3- \
| paste tmp.col{1,2}.fifo - \
Assume that $map1 prints very little and $map2 prints a lot.
paste tries to read a line from tmp.col1.fifo, but there is nothing to read, so it blocks. $map2 prints a lot to tmp.col2.fifo and fills the FIFO, so it blocks, too.
You have just been lucky that the race condition did not hit you earlier.
You can of course use temporary files to solve this, but I have the feeling you are trying to avoid that.
Maybe you can "increase" the size of the FIFO with a tool like mbuffer:
  | tee >(cut -f1 | parallel --pipe -k -j4 -N1000 "$map" | mbuffer -q -m6M -b5 > tmp.col1.fifo) \
  | tee >(cut -f2 | parallel --pipe -k -j4 -N1000 "$map" | mbuffer -q -m6M -b5 > tmp.col2.fifo) \
  | cut -f3- | mbuffer -q -m6M -b5 \
  | paste tmp.col{1,2}.fifo - \
  | python -m tqdm > /dev/null
But unless you know the nature of your data is not going to change, then this is a fragile solution that just kicks the can a bit further down the road.
How about this instead?
n=1000000
map=cat   # identity map: inp -> out
rm -f tmp.col{1,2,3,4}.fifo
mkfifo tmp.col{1,2,3,4}.fifo
paste <(seq $n) <(seq $n) <(seq $n) | cut -f1 | parallel --pipe -k -j4 -N1000 "$map" > tmp.col1.fifo &
paste <(seq $n) <(seq $n) <(seq $n) | cut -f2 | parallel --pipe -k -j4 -N1000 "$map" > tmp.col2.fifo &
paste <(seq $n) <(seq $n) <(seq $n) | cut -f3 > tmp.col3.fifo &
paste <(seq $n) <(seq $n) <(seq $n) > tmp.col4.fifo &
paste tmp.col{1,2,3,4}.fifo | python -m tqdm > /dev/null
You will run a few more pastes, but if CPU is not a problem, then this should give you no race conditions.
(Also: --id (aka. --semaphore-name) is not used with --pipe but only with --semaphore. See https://www.gnu.org/software/parallel/parallel_options_map.pdf)
(Also also: If you do not need exactly 1000 entries (-N1000) then --block is faster).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With