Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Julia load dataframe from s3 csv file

I'm having trouble finding an example to follow online for this simple use-case:

Load a CSV file from an s3 object location to julia DataFrame.

Here is what I tried that didn't work:

using AWSS3, DataFrames, CSV

filepath = S3Path("s3://muh-bucket/path/data.csv")

CSV.File(filepath) |> DataFrames            # fails

# but I am able to stat the file
stat(filepath)

#=
Status(  mode = -rw-rw-rw-,
  ...etc  
  size = 2141032 (2.0M),
  blksize = 4096 (4.0K),
  blocks = 523,
  mtime = 2021-09-01T23:55:26,
  ...etc
=#

I can also read the file to a string object locally:

data_as_string = String(AWSS3.read(filepath);
#"column_1\tcolumn_2\tcolumn_3\t...etc..."

My AWS config is in order, I can access the object from julia locally.

How to I get this into a dataframe?

like image 542
Merlin Avatar asked Sep 06 '25 21:09

Merlin


1 Answers

Thanks to help from the nice people on julia slack channel (#data).

bytes = AWSS3.read(S3Path("s3://muh-bucket/path/data.csv"))

typeof(bytes)
# Vector{UInt8} (alias for Array{UInt8, 1})

df = CSV.read(bytes, DataFrame)

Bingo, I'm in business. The CSV.jl maintainer mentions that S3Path types used to work when passed to CSV.read, so perhaps this will be even simpler in the future.

Helpful SO post for getting AWS configs in order

like image 99
Merlin Avatar answered Sep 11 '25 01:09

Merlin