Is there a way to merge multiple CSV files uploaded to AWS S3 bucket using Python?

Question

I need to setup an AWS Lambda function that triggers when new CSV files are uploaded to an S3 bucket to merge the CSV files into one Master file (they will have the same number of columns and column names), then that new Master file is uploaded to another S3 bucket.

I'm using Python for the Lambda function. I created a zip folder with my Lambda function and the dependencies I used (Pandas and Numpy) and uploaded that.

Currently I have to include the CSV files that I want merged together in the zip folder itself, the function merges those CSV files and the output (Master file) is in the logs, when I check in CloudWatch.

I don't know how to link my code to the S3 buckets for input and output.

This is for an app I'm working on.

here's the python code I'm using:

    import os
    import glob
    import numpy
    import pandas as pd

    def handler(event, context):
        #find all csv files in the folder
        #use glob pattern matching -> extension = 'csv'
        #save result in list -> all_filenames
        extension = 'csv'
        all_filenames = [i for i in glob.glob('*.{}'.format(extension))]

        #combine all files in the list
        combined_csv = pd.concat([pd.read_csv(f) for f in all_filenames ])

        #export to csv
        combined_csv.to_csv( "/tmp/combined_csv.csv", index=False, encoding='utf-8-sig')
        f = open("/tmp/combined_csv.csv", "r")
        print(f.read())
        f.close()

I would like to not have to manually input the CSV files in the same zip folder as my python script every time, and also have the output Master CSV file be in a separate S3 bucket.

John Rotenstein · Accepted Answer

I would recommend that you do this using Amazon Athena.

CREATE EXTERNAL TABLE to define the input location in Amazon S3 and format
CREATE TABLE AS to define the output location in Amazon S3 and format (CSV Zip), with a query (eg SELECT * FROM input-table)

This way, there is no need to download, process and upload the files. It will all be done by Amazon Athena. Plus, if the input files are compressed, the cost is lower because Athena is charged based upon the amount of data read from disk.

You could call Amazon Athena from the AWS Lambda function. Just make sure it only calls Athena after all the input files are in place.

Is there a way to merge multiple CSV files uploaded to AWS S3 bucket using Python?

Tags:

python

amazon-web-services

amazon-s3

a-n1

1 Answers

John Rotenstein

Recent Activity

Donate For Us

Is there a way to merge multiple CSV files uploaded to AWS S3 bucket using Python?

Tags:

python

amazon-web-services

amazon-s3

a-n1

1 Answers

John Rotenstein

Related questions

Recent Activity

Donate For Us