Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Repeat values n many times, n being in an array

I have an array a of values whose values at each index idx I'd like to repeat a certain amount b[idx] of times, given at the same index idx in another array (b), like such:

a = numpy.array([1, 2, 3 ,4, 5])
b = numpy.array([2, 3, 1, 2, 4])

Desired output:

c = numpy.array([1, 1, 2, 2, 2, 3, 4, 4, 5, 5, 5, 5])

I realize I could do something like this:

len_a = numpy.shape(a)[0]
sum_b = sum(b)
c = numpy.zeros((1, 0))
for idx in range(len_a):
    repeated_a = numpy.repeat(a[idx], b[idx])
    repeated_a = numpy.reshape(repeated_a, (1, numpy.shape(repeated_a)[0]))
    c = numpy.hstack((c, repeated_a))

However, looping is not a good option because it is slow. How would I go about making this faster? Some form of vectorization perhaps.

like image 604
spaghettibaguetti Avatar asked Jan 22 '26 06:01

spaghettibaguetti


2 Answers

You are looking for a built-in repeat function made for this purpose. Simply feed both arrays into the function:

np.repeat(a,b)
#[1 1 2 2 2 3 4 4 5 5 5 5]
like image 62
Ehsan Avatar answered Jan 24 '26 20:01

Ehsan


Depending on what you know about your two arrays, you might be able to squeeze out more performance than what the native gives you: Do you know what kind of numbers can appear in the two arrays? Do you know how long the resulting array will be, so you can take advantage of that and initialize that memory up front?

To illustrate what kind of difference this makes, let's just look at what you can squeeze out of Numba with no further tricks:

from numba import njit
import numpy as np

@njit
def f(a, b):
    s = b.sum()
    res = np.empty(s, dtype=np.int64)
    cs = 0
    for i in range(len(a)):
        bi = b[i]
        res[cs:cs+bi] = a[i]
        cs += bi
    return res

Let's use IPython's magic %timeit to see how this compares to np.repeat(a, b):

In [2]: a = np.random.randint(1, 6, 10000)
   ...: b = np.random.randint(1, 6, 10000)

In [3]: %timeit np.repeat(a, b)
135 µs ± 1.03 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [6]: %timeit f(a, b)
87 µs ± 485 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

So that's already a good improvement in itself. Now if you know what b.sum() is beforehand, you could avoid calculating it explicitly and thus save a bit of time. Or if you know how the values of b are distributed, you can figure out an upper bound of the array size and truncate res[:cs] at the end of the day.

like image 44
fuglede Avatar answered Jan 24 '26 19:01

fuglede



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!