Downloading a Large Number of Files from S3

Question

What's the Fastest way to get a large number of files (relatively small 10-50kB) from Amazon S3 from Python? (In the order of 200,000 - million files).

At the moment I am using boto to generate Signed URLs, and using PyCURL to get the files one by one.

Would some type of concurrency help? PyCurl.CurlMulti object?

I am open to all suggestions. Thanks!

gburgoon · Accepted Answer

I don't know anything about python, but in general you would want to break the task down into smaller chunks so that they can be run concurrently. You could break it down by file type, or alphabetical or something, and then run a separate script for each portion of the break down.

Kathy Van Stone · Answer

In the case of python, as this is IO bound, multiple threads will use of the CPU, but it will probably use up only one core. If you have multiple cores, you might want to consider the new multiprocessor module. Even then you may want to have each process use multiple threads. You would have to do some tweaking of number of processors and threads.

If you do use multiple threads, this is a good candidate for the Queue class.

Downloading a Large Number of Files from S3

Tags:

python

curl

amazon-web-services

amazon-s3

boto

The Unknown

2 Answers

gburgoon

Kathy Van Stone

Recent Activity

Donate For Us

Downloading a Large Number of Files from S3

Tags:

python

curl

amazon-web-services

amazon-s3

boto

The Unknown

2 Answers

gburgoon

Kathy Van Stone

Related questions

Recent Activity

Donate For Us