A colleague of mine wanted to run a FORTRAN program that takes files arguments and outputs their ordering (best first) against some biophysicochemical criterion. What he needed were the 10 best results.
While the files are not big, the problem is that he got a bash: /home/progs/bin/ardock: Argument list too long, so I created 6 digits long symlinks to the files and used them as argument, which worked ;-)
Now, if the number of files is really too huge for the above trick to work then what can you possibly do to get the 10 best out of all of them? Do you have to sort the files by chunk and compare the bests against the bests with something like this?
#!/bin/bash
best10() { ardock "$@" | head -n 10; }
export -f best10
find . -name '*.dat' -exec bash -c 'best10 "$@"' _ {} + |
xargs bash -c 'best10 "$@"' _ |
xargs bash -c 'best10 "$@"' _ |
xargs bash -c ... | ... | ...
The problem here is that the number of required xargs is not known in advance, so how can you make it a loop?
note: As the program is outputting a linefeed-delimited stream of filepaths, I know that xargs can potentially break. Don't worry about it here, you can consider the filenames to be alphanumeric.
I would suggest solving this problem via an iterative tournament.
The idea is that in the first round, you arbitrarily divide up all of your output into groups of N. The top 10 finishers of each group advance to the next round, where you again divide them into groups of N.
This is guaranteed to give you the top 10, assuming that ardock is deterministic, and that it provides a total order.
Here's the code. I started by creating a test version of your ardock program. It sorts the arguments given to it by their hash, and prints them out. This is just so I have something to test.
import sys
import hashlib
def md5(s):
m = hashlib.md5(s.encode('utf8'))
return m.hexdigest()
args = sys.argv[1:]
args = sorted(args, key=md5)
print('\n'.join(args))
Next, here is the Bash script which runs the tournament.
#!/bin/bash
# Maximum number of arguments ardock can accept at once
export MAX_ARGS=20
# How many of the top candidates should be kept?
export KEEP=10
# How many parallel copies of ardock to run.
# Use 0 to run one for every core you have.
CORES=1
ardock() {
python3 test462_ardock_substitute.py "$@"
}
ardock_wrapper() {
# Run ardock, outputting best $KEEP lines
ardock "$@" | head -n "$KEEP"
}
export -f ardock
export -f ardock_wrapper
# Create temp dir
dir="$(mktemp -d)"
echo "Created temp dir $dir"
level=0
# Make list of all candidates
seq 1 1000 > "$dir/$level.candidates"
while true; do
# 1) Read in $level candidates
# 2) Split into groups of $MAX_ARGS and run ardock
# 3) Output to $level + 1 candidates file
< "$dir/$level.candidates" \
xargs -P "$CORES" -n "$MAX_ARGS" bash -c 'ardock_wrapper "$@"' _ > \
"$dir/$((level + 1)).candidates"
((level+=1))
# Count lines in output
linecount="$(wc -l < "$dir/$level.candidates")"
echo "There are $linecount molecules remaining"
if [[ "$linecount" -le "$KEEP" ]]; then
break
fi
done
echo "Final winners:"
cat "$dir/$level.candidates"
Explanation:
0.candidates is created. This contains the filenames of every possible file you want to test, separated by newlines. In my case, it's just the first 1000 integers. Since this is a file, it can be as large as you want.$MAX_ARGS to each invocation of ardock.$MAX_ARGS: This must be larger than $KEEP in order to make progress, but it doesn't need to be much larger. For example, if they're 20 and 10, then each tournament round, the number of candidates shrinks by a factor of 2. Raising $MAX_ARGS makes the algorithm faster.)ardock_wrapper is responsible for taking the top $KEEP lines of each output from ardock.1.candidates.This code was tested on Linux using Bash 5.0.17.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With