Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Balanced row sample from dataframe with pandas given categorical target column

Tags:

python

pandas

Given a dataframe my goal is to sample rows such that values in one column are as balanced as possible. Say I have a dataframe below, the sample size is 3 and target column is c

a | b | c

1 | 2 | 0
3 | 4 | 0
5 | 6 | 1
7 | 8 | 2
9 | 10| 2
11| 12| 2

One of possible samples would be

a | b | c

1 | 2 | 0
5 | 6 | 1
7 | 8 | 2

In case of sample size is not a multiple of the number of unique classes, it is fine to have difference in 1 item or so.

How would I approach this in pandas?

EDIT: provided solution that worked for me in answers

like image 774
YohanRoth Avatar asked Sep 01 '25 03:09

YohanRoth


1 Answers

I first generated sample sizes for each unique value of column c so that it is balanced. The remainders are distributed over the first few elements

unique_values = df['c'].unique()
sample_sizes = [(k//len(df.columns))] * len(unique_values)
i = 0
while i < k%len(df.columns):
    sample_sizes[i]+= 1
    i= I+1

This bit generates the samples based on the generated sample sizes

df2= pd.concat([df.loc[df['c']  == unique_values[i]].sample() for i in range(len(sample_sizes)) for j in range(sample_sizes[i])])
like image 79
Mayowa Ayodele Avatar answered Sep 04 '25 04:09

Mayowa Ayodele