Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to bring pandas.Series.str.get_dummies() to report NaN?

I have data in a file. CSV-like but multiple values per field are possible. I use get_dummies() to generate an overview of my column. What is in there and how often. Just like an histogram with nominal data. I want to see the missing (nan) values. But my code hides them.

I am using: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.get_dummies.html

I can't use: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html the dummy_na would solve the problem

Reason: I need the sep parameter.

To illustrate the difference.

import pandas
data = pandas.read_csv("testdata.csv",sep=";")
Bla["a"].str.get_dummies(",").sum() #no nan values
pandas.get_dummies(Bla["a"],dummy_na=True).sum() #not separated

Data:

a;b
Test,Tes;
;a
Tes;a
T;b

I would expect:

T           1
Tes         2
Test        1
NaN         1

But the output is:

T       1
Tes     2
Test    1
dtype: int64

or

T           1
Tes         1
Test,Tes    1
NaN         1
dtype: int64

Happy to also use another function! Maybe the .str part is the problem. I have not quite figured out what that does.

like image 414
Sindbad Avatar asked Nov 30 '25 07:11

Sindbad


1 Answers

First replace missing values by Series.fillna and then in index by rename to NaN:

print (data["a"].fillna('Missing').str.get_dummies(",").sum().rename({'Missing':np.nan}))
NaN     1
T       1
Tes     2
Test    1
dtype: int64
like image 70
jezrael Avatar answered Dec 02 '25 20:12

jezrael



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!