Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Return array of counts for each feature of input

I have an array of integer labels and I would like to determine how many of each label is present and store those values in an array of the same size as the input. This can be accomplished with the following loop:

def counter(labels):
    sizes = numpy.zeros(labels.shape)
    for num in numpy.unique(labels):
        mask = labels == num
        sizes[mask] = numpy.count_nonzero(mask)
return sizes

with input:

array = numpy.array([
       [0, 1, 2, 3],
       [0, 1, 1, 3],
       [3, 1, 3, 1]])

counter() returns:

array([[ 2.,  5.,  1.,  4.],
       [ 2.,  5.,  5.,  4.],
       [ 4.,  5.,  4.,  5.]])

However, for large arrays, with many unique labels, 60,000 in my case, this takes a considerable amount time. This is the first step in a complex algorithm and I can't afford to spend more than about 30 seconds on this step. Is there a function that already exists that can accomplish this? If not, how can I speed up the existing loop?

like image 956
asheets Avatar asked Jan 27 '26 15:01

asheets


1 Answers

Approach #1

Here's one using np.unique -

_, tags, count = np.unique(labels, return_counts=1, return_inverse=1)
sizes = count[tags]

Approach #2

With positive numbers in labels, simpler and more efficient way with np.bincount -

sizes = np.bincount(labels)[labels]

Runtime test

Setup with 60,000 unique positive numbers and two such sets of lengths 100,000 and 1000,000 are timed.

Set #1 :

In [192]: np.random.seed(0)
     ...: labels = np.random.randint(0,60000,(100000))

In [193]: %%timeit
     ...: sizes = np.zeros(labels.shape)
     ...: for num in np.unique(labels):
     ...:     mask = labels == num
     ...:     sizes[mask] = np.count_nonzero(mask)
1 loop, best of 3: 2.32 s per loop

In [194]: %timeit np.bincount(labels)[labels]
1000 loops, best of 3: 376 µs per loop

In [195]: 2320/0.376 # Speedup figure
Out[195]: 6170.212765957447

Set #2 :

In [196]: np.random.seed(0)
     ...: labels = np.random.randint(0,60000,(1000000))

In [197]: %%timeit
     ...: sizes = np.zeros(labels.shape)
     ...: for num in np.unique(labels):
     ...:     mask = labels == num
     ...:     sizes[mask] = np.count_nonzero(mask)
1 loop, best of 3: 43.6 s per loop

In [198]: %timeit np.bincount(labels)[labels]
100 loops, best of 3: 5.15 ms per loop

In [199]: 43600/5.15 # Speedup figure
Out[199]: 8466.019417475727
like image 159
Divakar Avatar answered Jan 29 '26 05:01

Divakar



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!