Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a way to merge multiple CSV files uploaded to AWS S3 bucket using Python?

I need to setup an AWS Lambda function that triggers when new CSV files are uploaded to an S3 bucket to merge the CSV files into one Master file (they will have the same number of columns and column names), then that new Master file is uploaded to another S3 bucket.

I'm using Python for the Lambda function. I created a zip folder with my Lambda function and the dependencies I used (Pandas and Numpy) and uploaded that.

Currently I have to include the CSV files that I want merged together in the zip folder itself, the function merges those CSV files and the output (Master file) is in the logs, when I check in CloudWatch.

I don't know how to link my code to the S3 buckets for input and output.

This is for an app I'm working on.

here's the python code I'm using:

    import os
    import glob
    import numpy
    import pandas as pd

    def handler(event, context):
        #find all csv files in the folder
        #use glob pattern matching -> extension = 'csv'
        #save result in list -> all_filenames
        extension = 'csv'
        all_filenames = [i for i in glob.glob('*.{}'.format(extension))]

        #combine all files in the list
        combined_csv = pd.concat([pd.read_csv(f) for f in all_filenames ])

        #export to csv
        combined_csv.to_csv( "/tmp/combined_csv.csv", index=False, encoding='utf-8-sig')
        f = open("/tmp/combined_csv.csv", "r")
        print(f.read())
        f.close()

I would like to not have to manually input the CSV files in the same zip folder as my python script every time, and also have the output Master CSV file be in a separate S3 bucket.

like image 474
a-n1 Avatar asked Nov 15 '25 16:11

a-n1


1 Answers

I would recommend that you do this using Amazon Athena.

  • CREATE EXTERNAL TABLE to define the input location in Amazon S3 and format
  • CREATE TABLE AS to define the output location in Amazon S3 and format (CSV Zip), with a query (eg SELECT * FROM input-table)

This way, there is no need to download, process and upload the files. It will all be done by Amazon Athena. Plus, if the input files are compressed, the cost is lower because Athena is charged based upon the amount of data read from disk.

You could call Amazon Athena from the AWS Lambda function. Just make sure it only calls Athena after all the input files are in place.

like image 124
John Rotenstein Avatar answered Nov 18 '25 05:11

John Rotenstein



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!