So I have two sets of features that I wish to bin (classify) and then combine to create a new feature. It is not unlike classifying coordinates into grids on a map.
The issue is that the features are not evenly distributed and I would like to use quantiles when binning (like with pandas.qcut()) on both features/coordinates.
Is there a better way than doing qcut() on both features and then concatenating the result labels?
Create a cartesian product categorical.
Consider the dataframe df
df = pd.DataFrame(dict(A=np.random.rand(20), B=np.random.rand(20)))
           A         B
0   0.538186  0.038985
1   0.185523  0.438329
2   0.652151  0.067359
3   0.746060  0.774688
4   0.373741  0.009526
5   0.603536  0.149733
6   0.775801  0.585309
7   0.091238  0.811828
8   0.504035  0.639003
9   0.671320  0.132974
10  0.619939  0.883372
11  0.301644  0.882258
12  0.956463  0.391942
13  0.702457  0.099619
14  0.367810  0.071612
15  0.454935  0.651631
16  0.882029  0.015642
17  0.880251  0.348386
18  0.496250  0.606346
19  0.805688  0.401578
We can create new categoricals with pd.qcut
d1 = df.assign(
    A_cut=pd.qcut(df.A, 2, labels=[1, 2]),
    B_cut=pd.qcut(df.B, 2, labels=list('ab'))
)
           A         B A_cut B_cut
0   0.538186  0.038985     1     a
1   0.185523  0.438329     1     b
2   0.652151  0.067359     2     a
3   0.746060  0.774688     2     b
4   0.373741  0.009526     1     a
5   0.603536  0.149733     1     a
6   0.775801  0.585309     2     b
7   0.091238  0.811828     1     b
8   0.504035  0.639003     1     b
9   0.671320  0.132974     2     a
10  0.619939  0.883372     2     b
11  0.301644  0.882258     1     b
12  0.956463  0.391942     2     a
13  0.702457  0.099619     2     a
14  0.367810  0.071612     1     a
15  0.454935  0.651631     1     b
16  0.882029  0.015642     2     a
17  0.880251  0.348386     2     a
18  0.496250  0.606346     1     b
19  0.805688  0.401578     2     b
You can create the cartesian product categorical with tuples
d2 = d1.assign(cartesian=pd.Categorical(d1.filter(regex='_cut').apply(tuple, 1)))
print(d2)
           A         B A_cut B_cut cartesian
0   0.538186  0.038985     1     a    (1, a)
1   0.185523  0.438329     1     b    (1, b)
2   0.652151  0.067359     2     a    (2, a)
3   0.746060  0.774688     2     b    (2, b)
4   0.373741  0.009526     1     a    (1, a)
5   0.603536  0.149733     1     a    (1, a)
6   0.775801  0.585309     2     b    (2, b)
7   0.091238  0.811828     1     b    (1, b)
8   0.504035  0.639003     1     b    (1, b)
9   0.671320  0.132974     2     a    (2, a)
10  0.619939  0.883372     2     b    (2, b)
11  0.301644  0.882258     1     b    (1, b)
12  0.956463  0.391942     2     a    (2, a)
13  0.702457  0.099619     2     a    (2, a)
14  0.367810  0.071612     1     a    (1, a)
15  0.454935  0.651631     1     b    (1, b)
16  0.882029  0.015642     2     a    (2, a)
17  0.880251  0.348386     2     a    (2, a)
18  0.496250  0.606346     1     b    (1, b)
19  0.805688  0.401578     2     b    (2, b)
If you were so inclined, you could even declare an ordering for them.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With