Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Check S3 bucket for new files in last two hours

I need to create a monitoring tool, that checks buckets (with 1000+ files each) for new objects, created in last two hours, and if the objects were not created, sends a message. My first idea was to create a lambda function, that runs every 20 minutes. So I've created python3 + boto3 code:

import boto3
from datetime import datetime,timedelta
import pytz
import sys

s3 = boto3.resource('s3')
sns = boto3.client('sns')

buckets = ['bucket1', 'bucket2', 'bucket3']
check_fail = []

def check_bucket(event, context):
    time_now_UTC = datetime.utcnow().replace(tzinfo=pytz.UTC)
    delta_hours = time_now_UTC - timedelta(hours=2)

    for bucket_name in buckets:
        bucket = s3.Bucket(bucket_name)
        for key in bucket.objects.all():
            if key.last_modified >= delta_hours:
                print("There are new files in the bucket %s" %bucket)
                break
        else:
            check_fail.append(bucket)

    if len(check_fail) >= 1:    
        sns.publish(
        TopicArn='arn:aws:sns:us-east-1:xxxxxxxxxxxxxx:xxxxxx',
        Message="The following buckets didn't receive new files for longer than 2 hours: %s" %check_fail,
        Subject='AWS Notification Message' )
    else: 
        print("All buckets have new files")

This approach is not working, due to the high number of objects inside every bucket. Checking by "key.last_modified" is taking too long.

Does anyone have an idea on how I can achieve this?

Thank you!

like image 349
Andrey Avatar asked Sep 15 '25 04:09

Andrey


1 Answers

As you've seen S3 is optimised towards getting an object that you already know the path of, rather than listing an querying files. In fact the listObjects API is not massively stable during iteration and you're likely to miss files in large sets if they're added before you started the query.

Depending on the number of buckets you have, a way round this would be to use lambda triggers on S3 events:

  • S3 automatically raises s3:ObjectCreated event and invokes lambda
  • Lambda sets "LastUpdate" attribute for that bucket's entry in DynamoDb
  • Every 20 minutes (or so) you query/scan the Dynamo table to see when the latest update is.

Another solution would be to enable CloudWatch monioring on the bucket: https://docs.aws.amazon.com/AmazonS3/latest/dev/cloudwatch-monitoring.html

You could then sum the PutRequests and PostRequests metrics over the last two hours (you can fetch cloudwatch metrics this programmatically using boto3) to get an indication of updates (although, your count is only likely to be accurate if files are written once and never edited).

like image 103
thomasmichaelwallace Avatar answered Sep 16 '25 18:09

thomasmichaelwallace