Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Conditional count of number of rows with value greater than the current one

Maybe a bit of beginner's question, but my mind is really stuck.

I have a dataframe with certain values in a column called x, split into two groups.

   x     group
1  1.7   a
2  0     b
3  2.3   b
4  2.7   b
5  8.6   a
6  5.4   b
7  4.2   a
8  5.7   b

My purpose is for each row, to count how many rows of the other group have a value greater than the current one. So to make it more clear, for the first row (group a) I am looking to find how many rows of group b are greater than 1.7 (the answer is 4). The end result should look like :

   x     group   result
1  1.7   a       4
2  0     b       3
3  2.3   b       2
4  2.7   b       2
5  8.6   a       0
6  5.4   b       1
7  4.2   a       2
8  5.7   b       1

I have several rows in the dataframe, so ideally I would also like a relatively fast solution.

like image 831
kon176 Avatar asked Dec 05 '25 06:12

kon176


1 Answers

Use np.searchsorted:

df['result'] = 0

a = df.loc[df['group'] == 'a', 'x']
b = df.loc[df['group'] == 'b', 'x']

df.loc[a.index, 'result'] = len(b) - np.searchsorted(np.sort(b), a)
df.loc[b.index, 'result'] = len(a) - np.searchsorted(np.sort(a), b)

Output:

>>> df
     x group  result
1  1.7     a       4
2  0.0     b       3
3  2.3     b       2
4  2.7     b       2
5  8.6     a       0
6  5.4     b       1
7  4.2     a       2
8  5.7     b       1

Performance for 130K records

>>> %%timeit
    a = df.loc[df['group'] == 'a', 'x']
    b = df.loc[df['group'] == 'b', 'x']
    len(b) - np.searchsorted(np.sort(b), a)
    len(a) - np.searchsorted(np.sort(a), b)

31.8 ms ± 319 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Setup:

N = 130000
df = pd.DataFrame({'x': np.random.randint(1, 1000, N),
                   'group': np.random.choice(['a', 'b'], N, p=(0.7, 0.3))})
like image 158
Corralien Avatar answered Dec 06 '25 23:12

Corralien