I have a use case where I upload hundreds of file to my S3 bucket using multi part upload. After each upload I need to make sure that the uploaded file is not corrupt (basically check for data integrity). Currently, after uploading the file, I re-download it and compute the md5 on the content string and compare it with the md5 of local file. So something like:
conn = S3Connection('access key', 'secretkey')
bucket = conn.get_bucket('bucket_name')
source_path = 'file_to_upload'
source_size = os.stat(source_path).st_size
mp = bucket.initiate_multipart_upload(os.path.basename(source_path))
chunk_size = 52428800
chunk_count = int(math.ceil(source_size / chunk_size))
for i in range(chunk_count + 1):
   offset = chunk_size * i
   bytes = min(chunk_size, source_size - offset)
   with FileChunkIO(source_path, 'r', offset=offset, bytes=bytes) as fp:
       mp.upload_part_from_file(fp, part_num=i + 1, md5=k.compute_md5(fp, bytes))
mp.complete_upload()
    
obj_key = bucket.get_key('file_name')
print(obj_key.md5) #prints None
print(obj_key.base64md5) #prints None
content = bucket.get_key('file_name').get_contents_as_string()
# compute the md5 on content
This approach is wasteful as it doubles the bandwidth usage. I tried
bucket.get_key('file_name').md5 
bucket.get_key('file_name').base64md5 
but both return None.
Is there any other way to achieve md5 without downloading the whole thing?
yes
use bucket.get_key('file_name').etag[1 :-1] 
this way get key's MD5 without downloading it's contents.
With boto3, I use head_object to retrieve the ETag.
import boto3
import botocore
def s3_md5sum(bucket_name, resource_name):
    try:
        md5sum = boto3.client('s3').head_object(
            Bucket=bucket_name,
            Key=resource_name
        )['ETag'][1:-1]
    except botocore.exceptions.ClientError:
        md5sum = None
        pass
    return md5sum
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With