Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Load Pandas Dataframe to S3 passing s3_additional_kwargs

Please excuse my ignorance / lack of knowledge in this area!

I'm looking to upload a dataframe to S3, but I need to pass 'ACL':'bucket-owner-full-control'.

import pandas as pd
import s3fs

fs = s3fs.S3FileSystem(anon=False, s3_additional_kwargs={'ACL': 'bucket-owner-full-control'})
df = pd.DataFrame()
df['test'] = [1,2,3]
df.head()

df.to_parquet('s3://path/to/file/df.parquet', compression='gzip')

I have managed to get around this by then loading this to a Pyarrow table and the loading like:

import pyarrow.parquet as pq

table = pa.Table.from_pandas(df)

pq.write_to_dataset(table=table, 
                    root_path='s3://path/to/file/',
                    filesystem=fs) 

But this feels hacky and I feel there must be a way to pass the ACL in the first example.

like image 739
George Cooper-Pearson Avatar asked Sep 06 '25 03:09

George Cooper-Pearson


1 Answers

With Pandas 1.2.0, there is storage_options as mentioned here.

If you are stuck with Pandas < 1.2.0 (1.1.3 in my case), this trick did help:

storage_options = dict(anon=False, s3_additional_kwargs=dict(ACL="bucket-owner-full-control"))

import s3fs
fs = s3fs.S3FileSystem(**storage_options)
df.to_parquet('s3://foo/bar.parquet', filesystem=fs)
like image 189
Sergey Vasilyev Avatar answered Sep 09 '25 01:09

Sergey Vasilyev