Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Make DataFrame balanced with respect to a specific column

I have a DataFrame in Python that looks as follows.

  Text  Label
0  abc      0
1  def      1
2  ghi      1
3   .       .
4   .       .
5   .       .

There are 100 rows with label '1' and only 50 with label '0'. I would like to have a balanced set so that there are 50 rows with label '0' and 50 rows with label '1'. It does not matter which rows with label '1' that get thrown away.

Is there any concise way of writing this in Python?

like image 541
ryekos Avatar asked Oct 26 '25 08:10

ryekos


1 Answers

Use groupby and head:

df = df.groupby('Label').head(50)

This will take the first 50 from each subset of rows where Label is 0 and 1 respectively. In the case of rows with Label 1, the first 50 are picked, and the rest discarded.

To pick the last 50, replace head(50) with tail(50).

To pick 50 rows at random, use apply + sample:

df = (df.groupby('Label', as_index=False)
        .apply(lambda x: x.sample(n=50))
        .reset_index(drop=True))

Note, if any of the groups have lesser than N (=50) items, this will not work.

like image 163
cs95 Avatar answered Oct 28 '25 23:10

cs95