Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Randomly selecting rows from a dataframe based on a column value

I have a pandas data frame as follows:

col1, col2, label
a    b      0
b    b ,    0
.
.
..........  0
..........  1

and the value_counts for the label column:

df['label'].value_counts():

0: 200000
1: 10000

I want to select 50000 rows from label with value '0' at random such that my value_counts become:

0: 50000
1: 10000
like image 267
Adarsh Ravi Avatar asked Sep 01 '25 03:09

Adarsh Ravi


1 Answers

Filter each value and sample N values from each. Then, get their indexes, join through union and just loc

s0 = df.label[df.label.eq(0)].sample(50000).index
s1 = df.label[df.label.eq(1)].sample(10000).index 

df = df.loc[s0.union(s1)]

Of course, you don't need to specify the 10000 in the s1 if you're just getting all of them :) It's just there for illustration

like image 52
rafaelc Avatar answered Sep 02 '25 17:09

rafaelc