I have a dataset like this:
[[0,1],
 [0,2],
 [0,3],
 [0,4],
 [1,5],
 [1,6],
 [1,7],
 [2,8],
 [2,9]]
I need to delete the first elements of each subview of the data as defined by the first column. So first I get all elements that have 0 in the first column, and delete the first row: [0,1]. Then I get the elements with 1 in the first column and delete the first row [1,5], next step I delete [2,8] and so on and so forth. In the end, I would like to have a dataset like this:
[[0,2],
 [0,3],
 [0,4],
 [1,6],
 [1,7],
 [2,9]]
EDIT: Can this be done in numpy? My dataset is very large so for loops on all elements take at least 4 minutes to complete.
As requested, a numpy solution:
import numpy as np
a = np.array([[0,1], [0,2], [0,3], [0,4], [1,5], [1,6], [1,7], [2,8], [2,9]])
_,i = np.unique(a[:,0], return_index=True)
b = np.delete(a, i, axis=0)
(above is edited to incorporate @Jaime's solution, here is my original masking solution for posterity's sake)
m = np.ones(len(a), dtype=bool)
m[i] = False
b = a[m]
Interestingly, the mask seems to be faster:
In [225]: def rem_del(a):
   .....:     _,i = np.unique(a[:,0], return_index=True)
   .....:     return np.delete(a, i, axis = 0)
   .....: 
In [226]: def rem_mask(a):
   .....:     _,i = np.unique(a[:,0], return_index=True)
   .....:     m = np.ones(len(a), dtype=bool)
   .....:     m[i] = False
   .....:     return a[m]
   .....: 
In [227]: timeit rem_del(a)
10000 loops, best of 3: 181 us per loop
In [228]: timeit rem_mask(a)
10000 loops, best of 3: 59 us per loop
Pass in your lists and the key that you want to check values on.
def getsubset(set, index):
    hash = {}
    for list in set:
        if not list[index] in hash:
            set.remove(list)
            hash[list[index]]  = list
    return set
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With