While using read_csv with Pandas, if i want a given column to be converted to a type, a malformed value will interrupt the whole operation, without an indication about the offending value.
For example, running something like:
import pandas as pd
import numpy as np
df = pd.read_csv('my.csv', dtype={ 'my_column': np.int64 })
Will lead to a stack trace ending with the error:
ValueError: cannot safely convert passed user dtype of <i8 for object dtyped data in column ...
If i had the row number, or the offending value in the error message, i could add it to the list of known NaN values, but this way there is nothing i can do.
Is there a way to tell the parser to ignore failures and return a np.nan in that case?
Post Scriptum: Funnily enough, after parsing without any type suggestion (no dtype argument), d['my_column'].value_counts() seems to infer the dtype right and put np.nan correctly automatically, even though the actual dtype for the series is a generic object which will fail on almost every plotting and statistical operation
Thanks to the comments i realised that there is no NaN for integers, which was very surprising to me. Thus i switched to converting to float:
import pandas as pd
import numpy as np
df = pd.read_csv('my.csv', dtype={ 'my_column': np.float64 })
This gave me an understandable error message with the value of the failing conversion, so that i could add the failing value to the na_values:
df = pd.read_csv('my.csv', dtype={ 'my_column': np.float64 }, na_values=['n/a'])
This way i could finally import the CSV in a way which works with visualisation and statistical functions:
>>>> df['session_planned_os'].dtype
dtype('float64')
Once you are able to spot the right na_values, you can remove the dtype argument from read_csv. Type inference will now happen correctly:
df = pd.read_csv('my.csv', na_values=['n/a'])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With