I have found some strange behavior when using the Pandas read_excel function. I have a contrived example of a spreadsheet that looks like:
Name    age weight
Bob     35   70
Alice   15   ...
Sam     ...  65
Susan   92   63
Harold  42   ...
Missing values are indicated by '...' and I would like them read as NaN. I would also like to convert the weights to grams. My first attempt was:
df=pd.read_excel('test.xls',na_values='...',converters={'weight':lambda y:y*1000})
Unfortunately, this applies the converter function first, so that Susan and Harold's missing weights become not three dots but 3000 dots, which does not match the given "na_value" entry and are thus not turned into NaN. This is highly unexpected.
Is there a reason this is the best behavior for this function?
Did I do something obviously wrong in my implementation of it?
What's the best way to get the behavior I want?
The converters are run before na_values are applied, which is why you're seeing this behavior.
Just add a condition in your converter to catch the ... instances in weights -
 like adding a hand-rolled na_values:  
df = pd.read_excel("test.xls", 
                   na_values="...", 
                   converters={"weight":lambda w: w*1000 if isinstance(w, int) else None})
print(df)
     Name   age   weight
0     Bob  35.0  70000.0
1   Alice  15.0      NaN
2     Sam   NaN  65000.0
3   Susan  92.0  63000.0
4  Harold  42.0      NaN
Keep the actual na_values argument to ensure ... values in the age field are also converted.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With