How should I convert NaN value into categorical value based on condition. I am getting error while trying to convert Nan value.
category gender sub-category title
health&beauty NaN makeup lipbalm
health&beauty women makeup lipstick
NaN NaN NaN lipgloss
My DataFrame looks like this. And my function to convert NaN values in gender to categorical value looks like
def impute_gender(cols):
category=cols[0]
sub_category=cols[2]
gender=cols[1]
title=cols[3]
if title.str.contains('Lip') and gender.isnull==True:
return 'women'
df[['category','gender','sub_category','title']].apply(impute_gender,axis=1)
If I run the code I am getting error
----> 7 if title.str.contains('Lip') and gender.isnull()==True:
8 print(gender)
9
AttributeError: ("'str' object has no attribute 'str'", 'occurred at index category')
Complete Dataset -https://github.com/lakshmipriya04/py-sample
Or simply use loc as an option 3 to @COLDSPEED's answer
cond = (df['gender'].isnull()) & (df['title'].str.contains('lip'))
df.loc[cond, 'gender'] = 'women'
category gender sub-category title
0 health&beauty women makeup lipbalm
1 health&beauty women makeup lipstick
2 NaN women NaN lipgloss
If we are due with NaN values , fillna can be one of the method:-)
df.gender=df.gender.fillna(df.title.str.contains('lip').replace(True,'women'))
df
Out[63]:
category gender sub-category title
0 health&beauty women makeup lipbalm
1 health&beauty women makeup lipstick
2 NaN women NaN lipgloss
Some things to note here -
apply over 4 columns is wastefulapply is wasteful and inefficient, because it is slow, uses a lot of memory, and offers no vectorisation benefits to you.str accessor as you would a pd.Series object. title.contains would be enough. Or more pythonically, "lip" in title.gender.isnull sounds completely wrong to the interpreter because gender is a scalar, it has no isnull attributeOption 1np.where
m = df.gender.isnull() & df.title.str.contains('lip')
df['gender'] = np.where(m, 'women', df.gender)
df
category gender sub-category title
0 health&beauty women makeup lipbalm
1 health&beauty women makeup lipstick
2 NaN women NaN lipgloss
Which is not only fast, but simpler as well. If you're worried about case sensitivity, you can make your contains check case insensitive -
m = df.gender.isnull() & df.title.str.contains('lip', flags=re.IGNORECASE)
Option 2
Another alternative is using pd.Series.mask/pd.Series.where -
df['gender'] = df.gender.mask(m, 'women')
Or,
df['gender'] = df.gender.where(~m, 'women')
<!- ->
df
category gender sub-category title
0 health&beauty women makeup lipbalm
1 health&beauty women makeup lipstick
2 NaN women NaN lipgloss
The mask implicitly applies the new value to the column based on the mask provided.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With