Maybe a bit of beginner's question, but my mind is really stuck.
I have a dataframe with certain values in a column called x, split into two groups.
x group
1 1.7 a
2 0 b
3 2.3 b
4 2.7 b
5 8.6 a
6 5.4 b
7 4.2 a
8 5.7 b
My purpose is for each row, to count how many rows of the other group have a value greater than the current one. So to make it more clear, for the first row (group a) I am looking to find how many rows of group b are greater than 1.7 (the answer is 4). The end result should look like :
x group result
1 1.7 a 4
2 0 b 3
3 2.3 b 2
4 2.7 b 2
5 8.6 a 0
6 5.4 b 1
7 4.2 a 2
8 5.7 b 1
I have several rows in the dataframe, so ideally I would also like a relatively fast solution.
Use np.searchsorted:
df['result'] = 0
a = df.loc[df['group'] == 'a', 'x']
b = df.loc[df['group'] == 'b', 'x']
df.loc[a.index, 'result'] = len(b) - np.searchsorted(np.sort(b), a)
df.loc[b.index, 'result'] = len(a) - np.searchsorted(np.sort(a), b)
Output:
>>> df
x group result
1 1.7 a 4
2 0.0 b 3
3 2.3 b 2
4 2.7 b 2
5 8.6 a 0
6 5.4 b 1
7 4.2 a 2
8 5.7 b 1
Performance for 130K records
>>> %%timeit
a = df.loc[df['group'] == 'a', 'x']
b = df.loc[df['group'] == 'b', 'x']
len(b) - np.searchsorted(np.sort(b), a)
len(a) - np.searchsorted(np.sort(a), b)
31.8 ms ± 319 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Setup:
N = 130000
df = pd.DataFrame({'x': np.random.randint(1, 1000, N),
'group': np.random.choice(['a', 'b'], N, p=(0.7, 0.3))})
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With