Make DataFrame balanced with respect to a specific column

Question

I have a DataFrame in Python that looks as follows.

  Text  Label
0  abc      0
1  def      1
2  ghi      1
3   .       .
4   .       .
5   .       .

There are 100 rows with label '1' and only 50 with label '0'. I would like to have a balanced set so that there are 50 rows with label '0' and 50 rows with label '1'. It does not matter which rows with label '1' that get thrown away.

Is there any concise way of writing this in Python?

cs95 · Accepted Answer

Use groupby and head:

df = df.groupby('Label').head(50)

This will take the first 50 from each subset of rows where Label is 0 and 1 respectively. In the case of rows with Label 1, the first 50 are picked, and the rest discarded.

To pick the last 50, replace head(50) with tail(50).

To pick 50 rows at random, use apply + sample:

df = (df.groupby('Label', as_index=False)
        .apply(lambda x: x.sample(n=50))
        .reset_index(drop=True))

Note, if any of the groups have lesser than N (=50) items, this will not work.

Make DataFrame balanced with respect to a specific column

Tags:

python

pandas

dataframe

ryekos

1 Answers

cs95

Recent Activity

Donate For Us

Make DataFrame balanced with respect to a specific column

Tags:

python

pandas

dataframe

ryekos

1 Answers

cs95

Related questions

Recent Activity

Donate For Us