Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Most frequent occurrence (mode) of numpy array values based on IDs in another array

I have a 2-D array containing values and I would like to calculate the most frequent entry (i.e., the mode) from this data according to IDs in a second array.

data = np.array([[[ 0, 10, 50, 80, 80],
                  [10, 10, 50, 80, 90],
                  [10, 10, 50, 80, 90],
                  [50, 50, 80, 80, 80]])


ID = np.array([[[ 1,  1, 2, 3, 3],
                  [1, 1, 2, 3, 3],
                  [1, 1, 2, 3, 3],
                  [1, 2, 2, 2, 3]])


#Expected Result is:

[10 50 80]

The most frequent value in data array for ID=1 is 10, ID=2 is 50 and ID=3 is 80. I've been playing around with np.unique and combinations of np.bincount and np.argmax but I can't figure out how to get the result. Any help?

like image 872
Pamela G Avatar asked Dec 18 '25 21:12

Pamela G


1 Answers

This is one possible vectorized way to do it, if you have integer data and the number of different values and groups is not too huge.

import numpy as np

# Input data
data = np.array([[[ 0, 10, 50, 80, 80],
                  [10, 10, 50, 80, 90],
                  [10, 10, 50, 80, 90],
                  [50, 50, 80, 80, 80]]])
ID = np.array([[[1, 1, 2, 3, 3],
                [1, 1, 2, 3, 3],
                [1, 1, 2, 3, 3],
                [1, 2, 2, 2, 3]]])
# Find unique data values and group ids with reverse indexing
data_uniq, data_idx = np.unique(data, return_inverse=True)
id_uniq, id_idx = np.unique(ID, return_inverse=True)
# Number of unique data values
n = len(data_uniq)
# Number of ids
m = len(id_uniq)
# Change indices so values of each group are within separate intervals
grouped = data_idx + (n * np.arange(m))[id_idx]
# Count repetitions and reshape
# counts[i, j] has the number of apparitions of the j-th value in the i-th group
counts = np.bincount(grouped, minlength=n * m).reshape(m, n)
# Get the modes from the counts
modes = data_uniq[counts.argmax(1)]
# Print result
for group, mode in zip(id_uniq, modes):
    print(f'Mode of {group}: {mode}')

Output:

Mode of 1: 10
Mode of 2: 50
Mode of 3: 80

A quick benchmark for a particular problem size:

import numpy as np
import scipy.stats

def find_group_modes_loop(data, ID):
    # Assume ids are given sequentially starting from 1
    m = ID.max()
    modes = np.empty(m, dtype=data.dtype)
    for id in range(m):
        modes[id] = scipy.stats.mode(data[ID == id + 1])[0][0]
    return modes

def find_group_modes_vec(data, ID):
    # Assume ids are given sequentially starting from 1
    data_uniq, data_idx = np.unique(data, return_inverse=True)
    id_uniq = np.arange(ID.max(), dtype=data.dtype)
    n = len(data_uniq)
    m = len(id_uniq)
    grouped = data_idx + (n * np.arange(m))[ID.ravel() - 1]
    counts = np.bincount(grouped, minlength=n * m).reshape(m, n)
    return data_uniq[counts.argmax(1)]

# Make data
np.random.seed(0)
data = np.random.randint(0, 1_000, size=10_000_000)
ID = np.random.randint(1, 100, size=10_000_000)
print(np.all(find_group_modes_loop(data, ID) == find_group_modes_vec(data, ID)))
# True
%timeit find_group_modes_loop(data, ID)
# 212 ms ± 647 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit find_group_modes_vec(data, ID)
# 122 ms ± 3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

So at least for some cases the vectorized solution can be significantly faster than looping.

like image 136
jdehesa Avatar answered Dec 21 '25 13:12

jdehesa