Given a dataframe my goal is to sample rows such that values in one column are as balanced as possible.
Say I have a dataframe below, the sample size is 3
and target column is c
a | b | c
1 | 2 | 0
3 | 4 | 0
5 | 6 | 1
7 | 8 | 2
9 | 10| 2
11| 12| 2
One of possible samples would be
a | b | c
1 | 2 | 0
5 | 6 | 1
7 | 8 | 2
In case of sample size is not a multiple of the number of unique classes, it is fine to have difference in 1 item or so.
How would I approach this in pandas?
EDIT: provided solution that worked for me in answers
I first generated sample sizes for each unique value of column c so that it is balanced. The remainders are distributed over the first few elements
unique_values = df['c'].unique()
sample_sizes = [(k//len(df.columns))] * len(unique_values)
i = 0
while i < k%len(df.columns):
sample_sizes[i]+= 1
i= I+1
This bit generates the samples based on the generated sample sizes
df2= pd.concat([df.loc[df['c'] == unique_values[i]].sample() for i in range(len(sample_sizes)) for j in range(sample_sizes[i])])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With