Balanced row sample from dataframe with pandas given categorical target column

Question

Given a dataframe my goal is to sample rows such that values in one column are as balanced as possible. Say I have a dataframe below, the sample size is 3 and target column is c

a | b | c

1 | 2 | 0
3 | 4 | 0
5 | 6 | 1
7 | 8 | 2
9 | 10| 2
11| 12| 2

One of possible samples would be

a | b | c

1 | 2 | 0
5 | 6 | 1
7 | 8 | 2

In case of sample size is not a multiple of the number of unique classes, it is fine to have difference in 1 item or so.

How would I approach this in pandas?

EDIT: provided solution that worked for me in answers

Mayowa Ayodele · Accepted Answer

I first generated sample sizes for each unique value of column c so that it is balanced. The remainders are distributed over the first few elements

unique_values = df['c'].unique()
sample_sizes = [(k//len(df.columns))] * len(unique_values)
i = 0
while i < k%len(df.columns):
    sample_sizes[i]+= 1
    i= I+1

This bit generates the samples based on the generated sample sizes

df2= pd.concat([df.loc[df['c']  == unique_values[i]].sample() for i in range(len(sample_sizes)) for j in range(sample_sizes[i])])

Balanced row sample from dataframe with pandas given categorical target column

Tags:

python

pandas

YohanRoth

1 Answers

Mayowa Ayodele

Recent Activity

Donate For Us

Balanced row sample from dataframe with pandas given categorical target column

Tags:

python

pandas

YohanRoth

1 Answers

Mayowa Ayodele

Related questions

Recent Activity

Donate For Us