Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Boto3 - Download file only if modified since specified timestamp

Im trying to download a file from AWS s3 based on the file modified attribute. Currently I cannot use any other method except making repeated calls to the file and downloading it if it appears to have been changed / modified.

this is what I have :

import boto3
import botocore
from datetime import datetime
from time import sleep

session = boto3.Session(profile_name='test', region_name='us-west-2')
client = session.client('s3')

bucket_name = 'my_bucket'
my_file = 'testfile.txt'

def poll_s3(timestamp):
    response = client.get_object(
        Bucket=bucket_name,
        Key=my_file,
        IfModifiedSince=timestamp
    )
    print response

m_timestamp = datetime.now()

while True:
    sleep(5)
    try:
        poll_s3(m_timestamp)
        m_timestamp = datetime.now()
        print 'modified at ', m_timestamp
    except botocore.exceptions.ClientError as e:
        print 'Not modified at', m_timestamp

The idea is to start with a timestamp and see if the file has been modified since then. If yes then download it and update the original timestamp when the last file was downloaded, if not ignore it and retry in 5 secs

However my script keeps printing

modified at  2019-09-03 7:37:46.102198
modified at  2019-09-03 7:37:51.262606
modified at  2019-09-03 7:37:56.455355
modified at  2019-09-03 7:38:01.608554

even though the file hasnt been modified in days...

like image 264
letsc Avatar asked Oct 23 '25 23:10

letsc


2 Answers

If the IfModifiedSince is to be specified in GMT.

Use datetime.utcnow() instead of datetime.now().

like image 68
ckedar Avatar answered Oct 26 '25 12:10

ckedar


I think you should initialize m_timestamp with a time that is before the object's timestamp for sure, and then, to be safe, read it from the response instead of taking it from the time you made the request (otherwise you wouldn't notice if the object has been modified again in the polling interval).

def poll_s3(timestamp):
    response = client.get_object(
        Bucket=bucket_name,
        Key=my_file,
        IfModifiedSince=timestamp
    )
    return response

m_timestamp = datetime(2015, 1, 1)  # like in the example request in the docs

while True:
    sleep(5)
    try:
        response = poll_s3(m_timestamp)
        m_timestamp = response['LastModified']
        print 'Modified at ', m_timestamp
    except botocore.exceptions.ClientError as e:
        print 'Not modified since ', m_timestamp

I'm not sure about your test setup, but even if you updated the S3 object after your program started, you may not have been able to retrieve it because of time zone problems (datetime.now() returns a datetime without time zone information, which in the datetime documentation is correctly called a “naive” datetime, as opposed to an “aware” datetime). Using a starting timestamp back enough in time, and then the timestamps exactly as returned by the client, should make the behavior of your program independent of how time zones are handled.

like image 24
Walter Tross Avatar answered Oct 26 '25 14:10

Walter Tross