Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

safest way to read missing dates with pandas read_csv--given blank space turns in today's date bug

Tags:

python

pandas

py2.7 pandas version .13

What is the safest way to read a csv and convert the column to dates. I noticed that in my case, a white space in the column of dates was converted to today's date. Why?

here's my csv data

fake_file = StringIO.StringIO("""case,opdate,
7,10/18/2006,
7,10/18/2008,
621, ,""")

here's my code

df=pd.DataFrame(pd.read_csv('path.csv',parse_dates=['opdate']))

tragically fills in the white space with today's date!

df=pd.DataFrame(pd.read_csv('path.csv',parse_dates=['opdate'],na_values=' '))

works, but do i really have to know that it is always ' ', instead of say '' or 'null'.

What is the safest way to convert dates and keep the nulls (especially when the null isn't a consistent value)?

like image 278
Chet Meinzer Avatar asked Dec 07 '25 02:12

Chet Meinzer


2 Answers

One way is to pass a different date parser to read_csv (I threw in a null too):

fake_file = StringIO.StringIO("""case,opdate,
7,null,
7,10/18/2008,
621, ,""")

In [11]: parser = lambda x: pd.to_datetime(x, format='%m/%d/%Y', coerce=True)

In [12]: pd.read_csv(fake_file, parse_dates=['opdate'], date_parser=parser)
Out[12]:
   case     opdate  Unnamed: 2
0     7        NaT         NaN
1     7 2008-10-18         NaN
2   621        NaT         NaN

[3 rows x 3 columns]

Another option is to convert to dates after the fact using to_datetime:

In [21]: df = pd.read_csv(fake_file)

In [22]: pd.to_datetime(df.opdate, format='%m/%d/%Y')
ValueError: time data 'null' does not match format '%m/%d/%Y'

In [23]: pd.to_datetime(df.opdate, format='%m/%d/%Y', coerce=True)
Out[23]:
0          NaT
1   2008-10-18
2          NaT
Name: opdate, dtype: datetime64[ns]

In [24]: df['opdate'] = pd.to_datetime(df.opdate, format='%m/%d/%Y', coerce=True)

I think the fact that both to_datetime and read_csv convert blank/spaces to todays date is definitely a bug...

like image 123
Andy Hayden Avatar answered Dec 08 '25 14:12

Andy Hayden


You can specify NA values using the na_values argument to read_csv:

fake_file = StringIO.StringIO("""case,opdate,
7,10/18/2006,
7,10/18/2008,
621, ,""")

df = pd.read_csv(fake_file, parse_dates=[1], na_values={'opdate': ' '})

Output:

   case     opdate  Unnamed: 2
0     7 2006-10-18         NaN
1     7 2008-10-18         NaN
2   621        NaT         NaN
like image 29
Marius Avatar answered Dec 08 '25 14:12

Marius



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!