I have a dataset with n observations and say 2 variables X1 and X2. I am trying to classify each observation based on a set of conditions on their (X1, X2) values. For example, the dataset looks like
df: Index X1 X2 1 0.2 0.8 2 0.6 0.2 3 0.2 0.1 4 0.9 0.3
and the groups are defined by
I'd like to generate the following dataframe.
expected result: Index X1 X2 Group 1 0.2 0.8 1 2 0.6 0.2 4 3 0.2 0.1 3 4 0.9 0.3 4
Also, would it be better/faster to work with numpy arrays for this type of problems?
In answer to your last question, I definitely think pandas
is a good tool for this; it could be done in numpy
, but pandas is arguably more intuitive when working with dataframes, and fast enough for most applications. pandas
and numpy
also play really nicely together. For instance, in your case, you can use numpy.select
to build your pandas
column:
import numpy as np
import pandas as pd
# Lay out your conditions
conditions = [((df.X1 < 0.5) & (df.X2>=0.5)),
((df.X1>=0.5) & (df.X2>=0.5)),
((df.X1<0.5) & (df.X2<0.5)),
((df.X1>=0.5) & (df.X2<0.5))]
# Name the resulting groups (in the same order as the conditions)
choicelist = [1,2,3,4]
df['group']= np.select(conditions, choicelist, default=-1)
# Above, I've the default to -1, but change as you see fit
# if none of your conditions are met, then it that row would be classified as -1
>>> df
Index X1 X2 group
0 1 0.2 0.8 1
1 2 0.6 0.2 4
2 3 0.2 0.1 3
3 4 0.9 0.3 4
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With