Repeat values n many times, n being in an array

Question

I have an array a of values whose values at each index idx I'd like to repeat a certain amount b[idx] of times, given at the same index idx in another array (b), like such:

a = numpy.array([1, 2, 3 ,4, 5])
b = numpy.array([2, 3, 1, 2, 4])

Desired output:

c = numpy.array([1, 1, 2, 2, 2, 3, 4, 4, 5, 5, 5, 5])

I realize I could do something like this:

len_a = numpy.shape(a)[0]
sum_b = sum(b)
c = numpy.zeros((1, 0))
for idx in range(len_a):
    repeated_a = numpy.repeat(a[idx], b[idx])
    repeated_a = numpy.reshape(repeated_a, (1, numpy.shape(repeated_a)[0]))
    c = numpy.hstack((c, repeated_a))

However, looping is not a good option because it is slow. How would I go about making this faster? Some form of vectorization perhaps.

Ehsan · Accepted Answer

You are looking for a built-in repeat function made for this purpose. Simply feed both arrays into the function:

np.repeat(a,b)
#[1 1 2 2 2 3 4 4 5 5 5 5]

fuglede · Answer

Depending on what you know about your two arrays, you might be able to squeeze out more performance than what the native gives you: Do you know what kind of numbers can appear in the two arrays? Do you know how long the resulting array will be, so you can take advantage of that and initialize that memory up front?

To illustrate what kind of difference this makes, let's just look at what you can squeeze out of Numba with no further tricks:

from numba import njit
import numpy as np

@njit
def f(a, b):
    s = b.sum()
    res = np.empty(s, dtype=np.int64)
    cs = 0
    for i in range(len(a)):
        bi = b[i]
        res[cs:cs+bi] = a[i]
        cs += bi
    return res

Let's use IPython's magic %timeit to see how this compares to np.repeat(a, b):

In [2]: a = np.random.randint(1, 6, 10000)
   ...: b = np.random.randint(1, 6, 10000)

In [3]: %timeit np.repeat(a, b)
135 µs ± 1.03 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [6]: %timeit f(a, b)
87 µs ± 485 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

So that's already a good improvement in itself. Now if you know what b.sum() is beforehand, you could avoid calculating it explicitly and thus save a bit of time. Or if you know how the values of b are distributed, you can figure out an upper bound of the array size and truncate res[:cs] at the end of the day.

Repeat values n many times, n being in an array

Tags:

python

vectorization

numpy

spaghettibaguetti

2 Answers

Ehsan

fuglede

Recent Activity

Donate For Us

Repeat values n many times, n being in an array

Tags:

python

vectorization

numpy

spaghettibaguetti

2 Answers

Ehsan

fuglede

Related questions

Recent Activity

Donate For Us