I'd like to transform this :
In [4]: df
Out[4]:
label
0 (a, e)
1 (a, d)
2 (b,)
3 (d, e)
to This :
a b c d e
0 1 0 0 0 1
1 1 0 0 1 0
2 0 1 0 0 0
3 0 0 0 1 1
As you can see there are predefined columns, 'a', 'b', 'c', 'd', 'e' and c is empty but still exists.
I tried multiple things like this : df.str.join('|').str.get_dummies() first without all the columns just to get the dummies with multiple values in the input but I want to add the predefined columns thing to it.
Thank you for your help !
Good practice for sklearn
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
print (pd.DataFrame(mlb.fit_transform(df['label']),columns=mlb.classes_, index=df.index))
Create a new DataFrame, then stack + get_dummies. any along the original index for the dummies.
pd.get_dummies(pd.DataFrame([*df.label], index=df.index).stack()).any(level=0).astype(int)
a b d e
0 1 0 0 1
1 1 0 1 0
2 0 1 0 0
3 0 0 1 1
Because you have pre-defined columns, we can reindex and fill missing with 0.
res = pd.get_dummies(pd.DataFrame([*df.label], index=df.index).stack()).any(level=0)
res = res.reindex(list('abcde'), axis=1).fillna(0).astype(int)
# a b c d e
#0 1 0 0 0 1
#1 1 0 0 1 0
#2 0 1 0 0 0
#3 0 0 0 1 1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With