I'm trying to write a large pandas dataframe (shape 4247x10)
Nothing special, just using next code:
df_base = read_from_google_storage()
df_base.to_parquet(courses.CORE_PATH,
engine='pyarrow',
compression='gzip',
partition_cols=None)
I unsuccessfully tried to use different compressions, different partition_cols but fails anyway.
I mentioned It works fine with small dataframes (1000x10<) and it also works when I'm debugging and leave it enough time but in my case I'm getting an error:
Process finished with exit code 139 (interrupted by signal 11: SIGSEGV)
Libs I'm using:
pandas==0.25.3
pyarrow==0.15.1
The issue might be related to this: https://issues.apache.org/jira/browse/PARQUET-1345 but I'm not sure.
Here is the workaround I found:
from pyarrow import Table
from pyarrow import parquet as pq
df_base = pd.read_csv('big_df.csv')
table = Table.from_pandas(df_base, nthreads=1)
print(table.columns)
print(table.num_rows)
pq.write_table(table, courses.CORE_PATH, compression='GZIP')
I'm not sure why exactly it's failing, but setting nthreads=1
helps to avoid SIGSEGV (Segmentation error)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With