Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Save date column with NAT(null) from pandas to parquet

I need to read integer format nullable date values ('YYYYMMDD') to pandas and then save this pandas dataframe to Parquet as a Date32[Day] format in order for Athena Glue Crawler classifier to recognize that column as a date. The code below does not allow me to save the column to parquet from pandas:

import pandas as pd

dates = [None, "20200710", "20200711", "20200712"]
data_df = pd.DataFrame(dates, columns=['date'])
data_df['date'] = pd.to_datetime(data_df['date']).dt.date
data_df.to_parquet(r'my_path', engine='pyarrow')

I receive this error below:

Traceback (most recent call last):
  File "", line 123, in convert_column
    result = pa.array(col, type=type_, from_pandas=True, safe=safe)
  File "pyarrow\array.pxi", line 265, in pyarrow.lib.array
  File "pyarrow\array.pxi", line 80, in pyarrow.lib._ndarray_to_array
TypeError: an integer is required (got type datetime.date)

If I move the None value towards the end of the date list, this will work without any issue and pyarrow would infer the date column as Date32[Day]. My guess is that since the Pandas column type for dt.date is object plus the first value for the column is NaT (not a time), pyarrow is not able to infer the column as Date32[Day] from Pandas dataframe or some sample value, it infers the column as Integer instead. What is a good way to save this dataframe column to parquet as a Date32[Day] column without sorting the column values? Thanks.

like image 358
Yun Ling Avatar asked Oct 16 '25 05:10

Yun Ling


1 Answers

You are right. As the first value is NaT, you need to remove it without changing the datatype. I used the below code.

import pandas as pd

dates = [None, "20200710", "20200711", "20200712"]
data_df = pd.DataFrame(dates, columns=['date'])
data_df['date'] = pd.to_datetime(data_df['date']).dt.date

# In addition, add this line to remove NaT without changing type
# Change strfttime as you want (I have used YMD)
data_df['date'] = [d.strftime('%Y-%m-%d') if not pd.isnull(d) else '' for d in data_df['date']]

data_df.to_parquet(r'my_path', engine='pyarrow')

I hope this works for you and the error is solved.

like image 189
Abhay Avatar answered Oct 18 '25 03:10

Abhay



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!