I am trying to download a large archive (~ 1 TB) from Glacier using the Python package, Boto. The current method that I am using looks like this:
import os
import boto.glacier
import boto
import time
ACCESS_KEY_ID = 'XXXXX'
SECRET_ACCESS_KEY = 'XXXXX'
VAULT_NAME = 'XXXXX'
ARCHIVE_ID = 'XXXXX'
OUTPUT = 'XXXXX'
layer2 = boto.connect_glacier(aws_access_key_id = ACCESS_KEY_ID,
                              aws_secret_access_key = SECRET_ACCESS_KEY)
gv = layer2.get_vault(VAULT_NAME)
job = gv.retrieve_archive(ARCHIVE_ID)
job_id = job.id
while not job.completed:
    time.sleep(10)
    job = gv.get_job(job_id)
if job.completed:
    print "Downloading archive"
    job.download_to_file(OUTPUT)
The problem is that the job ID expires after 24 hours, which is not enough time to retrieve the entire archive. I will need to break the download into at least 4 pieces. How can I do this and write the output to a single file?
It seems that you can simply specify the chunk_size parameter when calling job.download_to_file like so :
if job.completed:
    print "Downloading archive"
    job.download_to_file(OUTPUT, chunk_size=1024*1024)
However, if you can't download the all the chunks during the 24 hours I don't think you can choose to download only the one you missed using layer2.
Using layer1 you can simply use the method get_job_output and specify the byte-range you want to download.
It would look like that :
file_size = check_file_size(OUTPUT)
if job.completed:
    print "Downloading archive"
    with open(OUTPUT, 'wb') as output_file:
        i = 0
        while True:
            response = gv.get_job_output(VAULT_NAME, job_id, (file_size + 1024 * 1024 * i, file_size + 1024 * 1024 * (i + 1)))
            output_file.write(response)
            if len(response) < 1024 * 1024:
                break
            i += 1
With this script you should be able to rerun the script when it fails and continue to download your archive where you left it.
By digging in the boto code I found a "private" method in the Job class that you might also use : _download_byte_range. With this method you can still use layer2.
file_size = check_file_size(OUTPUT)
if job.completed:
    print "Downloading archive"
    with open(OUTPUT, 'wb') as output_file:
        i = 0
        while True:
            response = job._download_byte_range(file_size + 1024 * 1024 * i, file_size + 1024 * 1024 * (i + 1)))
            output_file.write(response)
            if len(response) < 1024 * 1024:
                break
            i += 1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With