Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Write to parquet row by row in Python

I obtain messages in async cycle and from each message I parse row which is dictionary. I would like to write these rows into parquet. To implement this, I do the following:

fields = [('A', pa.float64()), ('B', pa.float64()), ('C', pa.float64()), ('D', pa.float64())]
schema = pa.schema(fields)
pqwriter = pq.ParquetWriter('sample.parquet', schema=schema, compression='gzip')

#async cycle starts here
async for message in messages:
   row = {'A': message[1], 'B': message[2], 'C': message[3], 'D': message[4]}
   table = pa.Table.from_pydict(row)
   pqwriter.write_table(table)
#end of async cycle
pqwriter.close()

Everything works perfect, however the resulting parquet-file is about ~5 Mb size, whereas if I perform writing to csv-file, I have the file of ~200 Kb size. I have checked that data types are the same (columns of csv are floatt, columns of parquet are floats)

Why my parquet is much larger than csv with the same data?

like image 484
Artem Alexandrov Avatar asked Oct 15 '25 23:10

Artem Alexandrov


2 Answers

Parquet is a columnar format which is optimized to write batches of data. It is not meant to be used to write data row by row.

It is not well suited for your use case. You may want to write intermediate rows of data in a more suitable format (say avro, csv) and then convert data in batches to parquet.

like image 129
0x26res Avatar answered Oct 18 '25 13:10

0x26res


I have achieved the desired results as follows:

chunksize = 1e6
data = []
fields = #list of tuples
schema = pa.schema(fields)

with pq.ParquetWriter('my_parquet', schema=schema) as writer:
#async cycle starts here
rows = #dict with structure as in fields
data.extend(rows)

if len(data)>chunksize:
   data = pd.DataFrame(data)
   table = pa.Table.from_pandas(data, schema=schema)
   writer.write_table(table)
   data = []
#end of async cycle
if len(data)!=0:
   data = pd.DataFrame(data)
   table = pa.Table.from_pandas(data, schema=schema)
   writer.write_table(table)
writer.close()

This code snipped does actually what I need.

like image 31
Artem Alexandrov Avatar answered Oct 18 '25 11:10

Artem Alexandrov



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!