I have a DataFrame in Python that looks as follows.
Text Label
0 abc 0
1 def 1
2 ghi 1
3 . .
4 . .
5 . .
There are 100 rows with label '1' and only 50 with label '0'. I would like to have a balanced set so that there are 50 rows with label '0' and 50 rows with label '1'. It does not matter which rows with label '1' that get thrown away.
Is there any concise way of writing this in Python?
Use groupby and head:
df = df.groupby('Label').head(50)
This will take the first 50 from each subset of rows where Label is 0 and 1 respectively. In the case of rows with Label 1, the first 50 are picked, and the rest discarded.
To pick the last 50, replace head(50) with tail(50).
To pick 50 rows at random, use apply + sample:
df = (df.groupby('Label', as_index=False)
.apply(lambda x: x.sample(n=50))
.reset_index(drop=True))
Note, if any of the groups have lesser than N (=50) items, this will not work.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With