Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Explicitly set dummy variables in Python

Tags:

python

pandas

Suppose I have data that looks like:

data = {'Name':['Tom', 'Bob', 'Dan', 'Jack'], 
        'Color1':['red', 'red', 'black', 'blue'],
        'Color2':['blue', 'green', 'green', 'white'],
        'Color3':['orange', 'purple', 'white', 'red'],
        'Color4':['', 'yellow', 'purple', '']
} 
df = pd.DataFrame(data) 

I want to set dummy variables for each person, so that if a specific color is listed for a person in any of color1, color2, color3, color4, then that person receives a 1, or else that person receives a 0. However, I'm not interested in setting a dummy variable for every color that appears: I'm only interested in setting variables for colors red, black, and yellow.

Thus the expected output would be:

result = {'Name':['Tom', 'Bob', 'Dan', 'Jack'], 
        'hasRed':[1, 1, 0, 1],
        'hasBlack':[0, 0, 1, 0],
        'hasYellow':[0, 0, 1, 0]
} 
result_df = pd.DataFrame(result) 

I know pandas has a get_dummy function, but I don't think it can be used on multiple columns for specific variables like I need in my case. Any suggestions on how to do this?

like image 931
lvnwrth Avatar asked Nov 24 '25 15:11

lvnwrth


1 Answers

Let us try melting the dataframe, filter the colors and crosstab:

colors = ['red','blue','yellow']

tmp = (df.melt('Name')
    .loc[lambda x: x['value'].isin(colors)]
)

pd.crosstab(tmp['Name'],tmp['value']).add_prefix('has_').reset_index()

Output:

value  Name  has_blue  has_red  has_yellow
0       Bob         0        1           1
1      Jack         1        1           0
2       Tom         1        1           0
like image 147
Quang Hoang Avatar answered Nov 27 '25 04:11

Quang Hoang



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!