how to sort a huge number of files with an obscure program that only outputs their ordering

Question

A colleague of mine wanted to run a FORTRAN program that takes files arguments and outputs their ordering (best first) against some biophysicochemical criterion. What he needed were the 10 best results.

While the files are not big, the problem is that he got a bash: /home/progs/bin/ardock: Argument list too long, so I created 6 digits long symlinks to the files and used them as argument, which worked ;-)

Now, if the number of files is really too huge for the above trick to work then what can you possibly do to get the 10 best out of all of them? Do you have to sort the files by chunk and compare the bests against the bests with something like this?

#!/bin/bash

best10() { ardock "$@" | head -n 10; }
export -f best10

find . -name '*.dat' -exec bash -c 'best10 "$@"' _ {} + |
xargs bash -c 'best10 "$@"' _ |
xargs bash -c 'best10 "$@"' _ |
xargs bash -c ... | ... | ...

The problem here is that the number of required xargs is not known in advance, so how can you make it a loop?

note: As the program is outputting a linefeed-delimited stream of filepaths, I know that xargs can potentially break. Don't worry about it here, you can consider the filenames to be alphanumeric.

Nick ODell · Accepted Answer

I would suggest solving this problem via an iterative tournament.

The idea is that in the first round, you arbitrarily divide up all of your output into groups of N. The top 10 finishers of each group advance to the next round, where you again divide them into groups of N.

This is guaranteed to give you the top 10, assuming that ardock is deterministic, and that it provides a total order.

Here's the code. I started by creating a test version of your ardock program. It sorts the arguments given to it by their hash, and prints them out. This is just so I have something to test.

import sys
import hashlib
def md5(s):
    m = hashlib.md5(s.encode('utf8'))
    return m.hexdigest()
args = sys.argv[1:]
args = sorted(args, key=md5)
print('
'.join(args))

Next, here is the Bash script which runs the tournament.

#!/bin/bash

# Maximum number of arguments ardock can accept at once
export MAX_ARGS=20
# How many of the top candidates should be kept?
export KEEP=10
# How many parallel copies of ardock to run.
# Use 0 to run one for every core you have.
CORES=1

ardock() {
    python3 test462_ardock_substitute.py "$@"
}
ardock_wrapper() {
    # Run ardock, outputting best $KEEP lines
    ardock "$@" | head -n "$KEEP"
}
export -f ardock
export -f ardock_wrapper

# Create temp dir
dir="$(mktemp -d)"
echo "Created temp dir $dir"

level=0
# Make list of all candidates
seq 1 1000 > "$dir/$level.candidates"

while true; do
    # 1) Read in $level candidates
    # 2) Split into groups of $MAX_ARGS and run ardock
    # 3) Output to $level + 1 candidates file
    < "$dir/$level.candidates" \
        xargs -P "$CORES" -n "$MAX_ARGS" bash -c 'ardock_wrapper "$@"' _ > \
        "$dir/$((level + 1)).candidates"
    ((level+=1))
    # Count lines in output
    linecount="$(wc -l < "$dir/$level.candidates")"
    echo "There are $linecount molecules remaining"
    if [[ "$linecount" -le "$KEEP" ]]; then
        break
    fi
done

echo "Final winners:"
cat "$dir/$level.candidates"

Explanation:

In each round, a file 0.candidates is created. This contains the filenames of every possible file you want to test, separated by newlines. In my case, it's just the first 1000 integers. Since this is a file, it can be as large as you want.
This file is split up using xargs, giving at most $MAX_ARGS to each invocation of ardock.
(Note about $MAX_ARGS: This must be larger than $KEEP in order to make progress, but it doesn't need to be much larger. For example, if they're 20 and 10, then each tournament round, the number of candidates shrinks by a factor of 2. Raising $MAX_ARGS makes the algorithm faster.)
ardock_wrapper is responsible for taking the top $KEEP lines of each output from ardock.
The output is concatenated together into another file, 1.candidates.
Repeat.
If there are ever less than or equal to $KEEP lines left, then the tournament is done.

This code was tested on Linux using Bash 5.0.17.

how to sort a huge number of files with an obscure program that only outputs their ordering

Tags:

bash

xargs

Fravadona

1 Answers

Nick ODell

Recent Activity

Donate For Us

how to sort a huge number of files with an obscure program that only outputs their ordering

Tags:

bash

xargs

Fravadona

1 Answers

Nick ODell

Related questions

Recent Activity

Donate For Us