Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

parallel download of 7000 files

Please would you advise about an effective method to download a large number of files from EBI : https://github.com/eQTL-Catalogue/eQTL-Catalogue-resources/tree/master/tabix

We can use wget sequentially on each file. I have seen some information about using a python script : How to parallelize file downloads?

although there might be some complementary ways by using bash script or R ?

like image 501
Bogdan Avatar asked Sep 07 '25 02:09

Bogdan


1 Answers

If you are not requiring R here, then the xargs command-line utility allows parallel execution. (I'm using the linux version in the findutils set of utilities. I believe this is also supported in the version of wget in git-bash. I don't know if the macos binary is installed by default nor if it includes this option, ymmv.)

For proof, I'll create a mywget script that prints the start time (and args) and then passes all arguments to wget.

(mywget)

echo "$(date) :: ${@}"
wget "${@}"

I also have a text file urllist with one URL per line (it's crafted so that I don't have to encode anything or worry about spaces, etc). (Because I'm using a personal remote server to benchmark this, and I don't that the slashdot-effect, I'll obfuscate the URLs here ...)

(urllist)

https://somedomain.com/quux0
https://somedomain.com/quux1
https://somedomain.com/quux2

First, no parallelization, simply consecutive (default). (The -a urllist is to read items from the file urllist instead of stdin. The -q is to be quiet, not required but certainly very helpful when doing things in parallel, since the typical verbose option has progress bars that will overlap each other.)

$ time xargs -a urllist ./mywget -q
Tue Feb  1 17:27:01 EST 2022 :: -q https://somedomain.com/quux0
Tue Feb  1 17:27:10 EST 2022 :: -q https://somedomain.com/quux1
Tue Feb  1 17:27:12 EST 2022 :: -q https://somedomain.com/quux2

real    0m13.375s
user    0m0.210s
sys     0m0.958s

Second, adding -P 3 so that I run up to 3 simultaneous processes. The -n1 is required so that each call to ./mywget gets only one URL. You can adjust this if you want a single call to download multiple files consecutively.

$ time xargs -n1 -P3 -a urllist ./mywget -q
Tue Feb  1 17:27:46 EST 2022 :: -q https://somedomain.com/quux0
Tue Feb  1 17:27:46 EST 2022 :: -q https://somedomain.com/quux1
Tue Feb  1 17:27:46 EST 2022 :: -q https://somedomain.com/quux2

real    0m13.088s
user    0m0.272s
sys     0m1.664s

In this case, as BenBolker suggested in a comment, parallel download saved me nothing, it still took 13 seconds. However, you can see that in the first block, they started sequentially with 9 seconds and 2 seconds in between each of the three downloads. (We can infer that the first file is much larger, taking 9 seconds, and the second file took about 2 seconds.) In the second block, all three started at the same time.

(Side note: this doesn't require a shell script at all; you can use R's system or the processx::run functions to call xargs -n1 -P3 wget -q with a text file of URLs that you create in R. So you can still do this comfortably from the warmth of your R console.)

like image 70
r2evans Avatar answered Sep 08 '25 16:09

r2evans