Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

python - choose top k elements of column 2 for each value of column 1 in sorted numpy matrix

I have a numpy array which is lexically sorted on first 2 columns like this:

c1  c2  c3
2  0.9  3223  
2  0.8  7899  
2  0.7  23211  
2  0.6  3232  
2  0.5  4478  
1  0.9  342    
1  0.8  3434  
1  0.7  24232   
1  0.6  332  
1  0.5  478

I want value of c3 for two top rows of each c1. So I want output like: 3223,7899, 342, 3434

What is the easiest way to do it in Python

like image 415
Shweta Avatar asked Jan 29 '26 01:01

Shweta


1 Answers

Assuming you have it in a numpy array like this: (ignore the scientific notation)

In [86]: arr
Out[86]: 
array([[  1.00000000e+00,   9.00000000e-01,   3.22300000e+03],
       [  1.00000000e+00,   8.00000000e-01,   7.89900000e+03],
       [  1.00000000e+00,   7.00000000e-01,   2.32110000e+04],
       [  1.00000000e+00,   6.00000000e-01,   3.23200000e+03],
       [  1.00000000e+00,   5.00000000e-01,   4.47800000e+03],
       [  2.00000000e+00,   9.00000000e-01,   3.42000000e+02],
       [  2.00000000e+00,   8.00000000e-01,   3.43400000e+03],
       [  2.00000000e+00,   7.00000000e-01,   2.42320000e+04],
       [  2.00000000e+00,   6.00000000e-01,   3.32000000e+02],
       [  2.00000000e+00,   5.00000000e-01,   4.78000000e+02]])

You can do:

arr[np.roll(arr[:,0], k) != arr[:,0],2]

Example:

In [87]: arr[np.roll(arr[:,0], 2) != arr[:,0],2]
Out[87]: array([ 3223.,  7899.,   342.,  3434.])

Explanation:

We shift (roll) c1 of k positions to get c1'. The rows where c1 != c1' are the first k rows for each distinct value of c1 (or less than k if that value of c1 does not have at least k rows). We use this to index the original array and get the c3 values we want.

It should also be completely vectorized and therefore quite efficient. Finding the first 5 values for each c1 in an array with 100000 rows and 1000 different c1 values (c1 from 1 to 1000, c2 from 100 to 1 for each c1, c3 random) takes only ~2.4ms on my computer:

In [132]: c1 = np.repeat(np.linspace(1,1000, 1000), 100)

In [133]: c2 = np.tile(np.linspace(100, 1, 100), 1000)

In [134]: c3 = np.random.random_integers(1, 10000, size=100000)

In [135]: arr = np.column_stack((c1, c2, c3))

In [136]: arr
Out[136]: 
array([[  1.00000000e+00,   1.00000000e+02,   2.21700000e+03],
       [  1.00000000e+00,   9.90000000e+01,   9.23000000e+03],
       [  1.00000000e+00,   9.80000000e+01,   1.47900000e+03],
       ..., 
       [  1.00000000e+03,   3.00000000e+00,   7.41600000e+03],
       [  1.00000000e+03,   2.00000000e+00,   2.08000000e+03],
       [  1.00000000e+03,   1.00000000e+00,   3.41300000e+03]])

In [137]: %timeit arr[ np.roll(arr[:,0], 5) != arr[:,0], 2]
100 loops, best of 3: 2.36 ms per loop
like image 71
LeartS Avatar answered Jan 30 '26 15:01

LeartS