Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

In python, is there an efficient way of seperating an array with elements mapped to another array?

Let's say i have an arbitrary array np.array([1,2,3,4,5,6]) with another array that maps specific elements in the array to a group, np.array(['a','b', 'a','c','c', 'b']) and now I want to seperate them into three different array depending on the label/group given in the second array, so that they are a,b,c = narray([1,3]), narray([2,6]), narray([4,5]). Is a simple forloop the way to go or is there some efficient method I'm missing here?

like image 206
Caesar Avatar asked Jan 20 '26 00:01

Caesar


2 Answers

When you write efficient, I assume that what you want here is actually fast.

I will try to discuss briefly asymptotic efficiency. In this context, we refer to N as the input size and K as the number of unique values.

My approach solution would be to use a combination of np.argsort() and a custom-built groupby_np() specifically optimized for NumPy inputs:

import numpy as np


def groupby_np(arr, both=True):
    n = len(arr)
    extrema = np.nonzero(arr[:-1] != arr[1:])[0] + 1
    if both:
        last_i = 0
        for i in extrema:
            yield last_i, i
            last_i = i
        yield last_i, n
    else:
        yield 0
        yield from extrema
        yield n


def labeling_groupby_np(values, labels):
    slicing = labels.argsort()
    sorted_labels = labels[slicing]
    sorted_values = values[slicing]
    del slicing
    result = {}
    for i, j in groupby_np(sorted_labels, True):
        result[sorted_labels[i]] = sorted_values[i:j]
    return result

This has complexity O(N log N + K). The N log N comes from the sorting step and the K comes from the last loop. The interesting part is that the both the N-dependent and the K-dependent steps are fast because the N-dependent part is executed at low level, and the K-dependent part is O(1) and also fast.


A solution like the following (very similar to @theEpsilon answer):

import numpy as np


def labeling_loop(values, labels):
    labeled = {}
    for x, l in zip(values, labels):
        if l not in labeled:
            labeled[l] = [x]
        else:
            labeled[l].append(x)
    return {k: np.array(v) for k, v in labeled.items()}

uses two loops and has O(N + K). I do not think you can easily avoid the second loop (without a significant speed penalty). As for the first loop, this is executed in Python, which carries a significant speed penalty on its own.


Another possibility is to use np.unique() which brings the main loop to a lower level. However, this brings other challenges, because once the unique values are extracted, there is no efficient way of extracting the information to construct the arrays you want without some NumPy advanced indexing, which is O(N). The overall complexity of these solutions is then O(K * N), but because the NumPy advanced indexing is done at lower level, this can land to relatively fast solution, although with a worse asymptotic complexity than alternatives.

Possible implementations include (similar to @AjayVerma's and @AKX's answers):

import numpy as np


def labeling_unique_bool(values, labels):
    return {l: values[l == labels] for l in np.unique(labels)}
import numpy as np


def labeling_unique_nonzero(values, labels):
    return {l: values[np.nonzero(l == labels)] for l in np.unique(labels)}

Additionally, one could consider a pre-sorting step to then speed up the slicing part by avoiding NumPy advanced indexing. However, the sorting step can be more costly than the advanced indexing, and in general the proposed approach tends to be faster for the input I tested.

import numpy as np


def labeling_unique_argsort(values, labels):
    uniques, counts = np.unique(labels, return_counts=True)
    sorted_values = values[labels.argsort()]
    bound = 0
    result = {}
    for x, c in zip(uniques, counts):
        result[x] = sorted_values[bound:bound + c]
        bound += c
    return result

Another approach, which is neat in principle (same as my proposed approach), but slow in practice would be to use sorting and itertools.groupby():

import itertools
from operator import itemgetter


def labeling_groupby(values, labels):
    slicing = labels.argsort()
    sorted_labels = labels[slicing]
    sorted_values = values[slicing]
    del slicing
    result = {}
    for x, g in itertools.groupby(zip(sorted_labels, sorted_values), itemgetter(0)):
        result[x] = np.fromiter(map(itemgetter(1), g), dtype=sorted_values.dtype)
    return result

Finally, a Pandas based approach, which is quite concise and reasonably fast for larger inputs, but under-performing for smaller ones (similar to @Ehsan's answer):

def labeling_groupby_pd(values, labels):
    df = pd.DataFrame({'values': values, 'labels': labels})
    return df.groupby('labels').values.apply(lambda x: x.values).to_dict()

Now, talking is cheap, so let us attach some numbers to fast and slow and produce some plots for varying input sizes. The value of K is capped to 52 (lower and upper case letters of the English alphabet). When N is much larger than K, the probability of reaching capping value is very high.

Input is generated programmatically with the following:

def gen_input(n, p, labels=string.ascii_letters):
    k = len(labels)
    values = np.arange(n)
    labels = np.array([string.ascii_letters[i] for i in np.random.randint(0, int(k * p), n)])
    return values, labels

and the benchmarks are produced for values of p from (1.0, 0.5, 0.1, 0.05), which change the maximum value of K. The plots below refer to the p values in that order.


p=1.0 (at most K = 52)

bm_p100

...and zoomed on the fastest

bm_p100_zoom


p=0.5 (at most K = 26)

bm_p50


p=0.1 (at most K = 5)

bm_p10


p=0.05 (at most K = 2)

bm_p5

...and zoomed on the fastest

bm_p5_zoom


One can see how the proposed method, except for very small inputs, outperforms the other methods proposed so far for the tested inputs.

(Full benchmarks available here).


One may also consider moving some parts of the looping to Numba / Cython, but I'd leave this to the interested reader.

like image 106
norok2 Avatar answered Jan 21 '26 14:01

norok2


You can use numpy.unique

x = np.array([1,2,3,4,5,6])
y = np.array(['a','b', 'a','c','c', 'b'])
print({value:x[y==value] for value in np.unique(y)})

Output

{'a': array([1, 3]), 'b': array([2, 6]), 'c': array([4, 5])}
like image 36
Ajay Verma Avatar answered Jan 21 '26 14:01

Ajay Verma



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!