Repartitioning pyarrow tables by size by use of pyarrow and writing into several parquet files?

Question

As the title states, I would like to repartition a pyarrow table by size (or row group size) by use of pyarrow and writing into several parquet files.

I have had a look to pyarrow documentation, and identified the partitioned dataset chapter which may seem to be a direction. Unfortunately, it shows that partitioning by column content is possible, but not by size (or row group size).

So, starting from one table, how can I control the writing step so that several files are written with controlled size x MB? (or row group size)

import pandas as pd
import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq

file = 'example.parquet'
file_res = 'example_res'

# Generate a random df
df = pd.DataFrame(np.random.randint(100,size=(100000, 20)),columns=['A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T'])
table = pa.Table.from_pandas(df)

# With this command, I can write a single parquet file that contains 2 row groups.
pq.write_table(table, file, version='2.0', row_group_size=50000)

# I can read it back and try to write it as a partitioned dataset, but a single parquet file is then written.
table_new = pq.ParquetFile(file).read()
pq.write_to_dataset(table_new, file_res)

Thanks for any help! Bests,

0x26res · Accepted Answer

Looking at the doc for write_to_dataset and ParquetWriter, I can't think of anything obvious.

But you could assign a bucket to each row and partition your data based on the bucket, for example:

df = (
    pd.DataFrame(np.random.randint(100,size=(100000, 20)),columns=['A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T'])
    .assign(bucket=lambda x: x.index // 5000)
)
table = pa.Table.from_pandas(df)
pq.write_to_dataset(table_new, file_res, partition_cols=['bucket'])

And you'll get the follwoing file structure:

bucket=0
bucket=1
bucket=10
bucket=11
bucket=12
bucket=13
bucket=14
bucket=15
bucket=16
bucket=17
bucket=18
bucket=19
bucket=2
bucket=3
bucket=4
bucket=5
bucket=6
bucket=7
bucket=8
bucket=9

This is assuming your df.index starts at zero and increase one by one (0, 1, 2, 3...)

Repartitioning pyarrow tables by size by use of pyarrow and writing into several parquet files?

Tags:

python

parquet

pyarrow

partition

pierre_j

1 Answers

0x26res

Recent Activity

Donate For Us

Repartitioning pyarrow tables by size by use of pyarrow and writing into several parquet files?

Tags:

python

parquet

pyarrow

partition

pierre_j

1 Answers

0x26res

Related questions

Recent Activity

Donate For Us