Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Interactions between dummies variables in python

I'm trying to understand how can I address columns after using get_dummies. For example, let's say I have three categorical variables. first variable has 2 levels. second variable has 5 levels. third variable has 2 levels.

df=pd.DataFrame({"a":["Yes","Yes","No","No","No","Yes","Yes"], "b":["a","b","c","d","e","a","c"],"c":["1","2","2","1","2","1","1"]})

I created dummies for all three variable in order to use them in sklearn regression in python.

df1 = pd.get_dummies(df,drop_first=True)

Now I want to create two interactions (multiplication): bc , ba

how can I create the multiplication between each dummies variable to another one without using their specific names like that:

df1['a_yes_b'] = df1['a_Yes']*df1['b_b']
df1['a_yes_c'] = df1['a_Yes']*df1['b_c']
df1['a_yes_d'] = df1['a_Yes']*df1['b_d']
df1['a_yes_e'] = df1['a_Yes']*df1['b_e']

df1['c_2_b'] = df1['c_2']*df1['b_b']
df1['c_2_c'] = df1['c_2']*df1['b_c']
df1['c_2_d'] = df1['c_2']*df1['b_d']
df1['c_2_e'] = df1['c_2']*df1['b_e']

Thanks.

like image 868
Adi Milrad Avatar asked Jul 02 '26 08:07

Adi Milrad


1 Answers

You can use loops for creating new columns, for filtering column names is possible use filtering by boolean indexing and str.startswith:

a = df1.columns[df1.columns.str.startswith('a')]
b = df1.columns[df1.columns.str.startswith('b')]
c = df1.columns[df1.columns.str.startswith('c')]

for col1 in b:
    for col2 in a:
        df1[col2 + '_' + col1.split('_')[1]] = df1[col1].mul(df1[col2])

for col1 in b:
    for col2 in c:
        df1[col2 + '_' + col1.split('_')[1]] = df1[col1].mul(df1[col2])
print (df1)

   a_Yes  b_b  b_c  b_d  b_e  c_2  a_Yes_b  a_Yes_c  a_Yes_d  a_Yes_e  c_2_b  \
0      1    0    0    0    0    0        0        0        0        0      0   
1      1    1    0    0    0    1        1        0        0        0      1   
2      0    0    1    0    0    1        0        0        0        0      0   
3      0    0    0    1    0    0        0        0        0        0      0   
4      0    0    0    0    1    1        0        0        0        0      0   
5      1    0    0    0    0    0        0        0        0        0      0   
6      1    0    1    0    0    0        0        1        0        0      0   

   c_2_c  c_2_d  c_2_e  
0      0      0      0  
1      0      0      0  
2      1      0      0  
3      0      0      0  
4      0      0      1  
5      0      0      0  
6      0      0      0  

But if a and b have only one columns (in sample yes, in real data maybe) use: filter, mul, squeeze and concat:

a = df1.filter(regex='^a')
b = df1.filter(regex='^b')
c = df1.filter(regex='^c')

dfa = b.mul(a.squeeze(), axis=0).rename(columns=lambda x: a.columns[0] + x[1:])
dfc = b.mul(c.squeeze(), axis=0).rename(columns=lambda x: c.columns[0] + x[1:])

df1 = pd.concat([df1, dfa, dfc], axis=1)
print (df1)
   a_Yes  b_b  b_c  b_d  b_e  c_2  a_Yes_b  a_Yes_c  a_Yes_d  a_Yes_e  c_2_b  \
0      1    0    0    0    0    0        0        0        0        0      0   
1      1    1    0    0    0    1        1        0        0        0      1   
2      0    0    1    0    0    1        0        0        0        0      0   
3      0    0    0    1    0    0        0        0        0        0      0   
4      0    0    0    0    1    1        0        0        0        0      0   
5      1    0    0    0    0    0        0        0        0        0      0   
6      1    0    1    0    0    0        0        1        0        0      0   

   c_2_c  c_2_d  c_2_e  
0      0      0      0  
1      0      0      0  
2      1      0      0  
3      0      0      0  
4      0      0      1  
5      0      0      0  
6      0      0      0  
like image 198
jezrael Avatar answered Jul 03 '26 22:07

jezrael