I understand that Pandas can read and write to and from Parquet files using different backends: pyarrow and fastparquet.
I have a Conda distribution with the Intel distribution and "it works": I can use pandas.DataFrame.to_parquet. However I do not have pyarrow installed so I guess that fastparquet is used (which I cannot find either).
Is there a way to identify which backend is used?
Pandas provides a beautiful Parquet interface. Pandas leverages the PyArrow library to write Parquet files, but you can also write Parquet files directly from PyArrow.
DataFrame - to_parquet() function The to_parquet() function is used to write a DataFrame to the binary parquet format. This function writes the dataframe as a parquet file. File path or Root Directory path. Will be used as Root Directory path while writing a partitioned dataset.
According to it, pyarrow is faster than fastparquet, little wonder it is the default engine used in dask.
Just execute these 2 commands in linux shell/bash
pip install pyarrow
pip install fastparquet
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With