Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

High performance all-to-all comparison of vectors in Python

First to tell about the background: several methods for comparison between clusterings relies on so called pair counting. We have two vectors of flat clusterings a and b over the same n entities. At pair counting for all possible pairs of entities we check whether if those belong to the same cluster in both, or to the same in a but different in b, or the opposite, or to different clusters in both. This way we get 4 counts, let's call them n11, n10, n01, n00. These are input for different metrics.

When the number of entities is around 10,000, and the number of clusterings to compare is dozens or more, the performance becomes an issue, as the number of pairs is 10^8 for each comparison, and for an all-to-all comparison of clusterings this needs to be performed 10^4 times.

With a naive Python implementation it took really forever, so I turned to Cython and numpy. This way I could push the time for one comparison down to around 0.9-3.0 seconds. Which still means half day runtime in my case. I am wondering if you see any possibility for performance achievement, with some clever algorithm or C library, or whatever.

Here are my attempts:

1) This counts without allocating huge arrays for all the pairs, takes 2 membership vectors a1, a2 of length n, and returns the counts:

cimport cython
import numpy as np
cimport numpy as np

ctypedef np.int32_t DTYPE_t

@cython.boundscheck(False)
def pair_counts(
    np.ndarray[DTYPE_t, ndim = 1] a1,
    np.ndarray[DTYPE_t, ndim = 1] a2,
    ):

    cdef unsigned int a1s = a1.shape[0]
    cdef unsigned int a2s = a2.shape[0]

    cdef unsigned int n11, n10, n01, n00
    n11 = n10 = n01 = n00 = 0
    cdef unsigned int j0

    for i in range(0, a1s - 1):
        j0 = i + 1
        for j in range(j0, a2s):
            if a1[i] == a1[j] and a2[i] == a2[j]:
                n11 += 1
            elif a1[i] == a1[j]:
                n10 += 1
            elif a2[i] == a2[j]:
                n01 += 1
            else:
                n00 += 1

    return n11, n10, n01, n00

2) This first calculates comembership vectors (length n * (n-1) / 2, one element for each entity pair) for each of the 2 clusterings, then calculates the counts from these vectors. It allocates ~20-40M memory at each comparison, but interestingly, faster than the previous. Note: c is a wrapper class around a clustering, having the usual membership vector, but also a c.members dict which contains the indices of entities for each cluster in numpy arrays.

cimport cython
import numpy as np
cimport numpy as np

@cython.boundscheck(False)
def comembership(c):
    """
    Returns comembership vector, where each value tells if one pair
    of entites belong to the same group (1) or not (0).
    """
    cdef int n = len(c.memberships)
    cdef int cnum = c.cnum
    cdef int ri, i, ii, j, li

    cdef unsigned char[:] cmem = \
        np.zeros((int(n * (n - 1) / 2), ), dtype = np.uint8)

    for ci in xrange(cnum):
        # here we use the members dict to have the indices of entities
        # in cluster (ci), as a numpy array (mi)
        mi = c.members[ci]
        for i in xrange(len(mi) - 1):
            ii = mi[i]
            # this is only to convert the indices of an n x n matrix
            # to the indices of a 1 x (n x (n-1) / 2) vector:
            ri = n * ii - 3 * ii / 2 - ii ** 2 / 2 - 1
            for j in mi[i+1:]:
                # also here, adding j only for having the correct index
                li = ri + j
                cmem[li] = 1

    return np.array(cmem)

def pair_counts(c1, c2):
    p1 = comembership(c1)
    p2 = comembership(c2)
    n = len(c1.memberships)

    a11 = p1 * p2

    n11 = a11.sum()
    n10 = (p1 - a11).sum()
    n01 = (p2 - a11).sum()
    n00 = n - n10 - n01 - n11

    return n11, n10, n01, n00

3) This is a pure numpy based solution with creating an n x n boolean array of comemberships of entity pairs. The inputs are the membership vectors (a1, a2).

def pair_counts(a1, a2):

    n = len(a1)
    cmem1 = a1.reshape([n,1]) == a1.reshape([1,n])
    cmem2 = a2.reshape([n,1]) == a2.reshape([1,n])

    n11 = int(((cmem1 == cmem2).sum() - n) / 2)
    n10 = int((cmem1.sum() - n) / 2) - n11
    n01 = int((cmem2.sum() - n) / 2) - n11
    n00 = n - n11 - n10 - n01

    return n11, n10, n01, n00

Edit: example data

import numpy as np

a1 = np.random.randint(0, 1868, 14440, dtype = np.int32)
a2 = np.random.randint(0, 484, 14440, dtype = np.int32)

# to have the members dicts used in example 2:
def get_cnum(a):
    """
    Returns number of clusters.
    """
    return len(np.unique(a))

def get_members(a):
    """
    Returns a dict with cluster numbers as keys and member entities
    as sorted numpy arrays.
    """
    members = dict(map(lambda i: (i, []), range(max(a) + 1)))
    list(map(lambda m: members[m[1]].append(m[0]),
        enumerate(a)))
    members = dict(map(lambda m:
       (m[0], np.array(sorted(m[1]), dtype = np.int)),   
       members.items()))
    return members

members1 = get_members(a1)
members2 = get_members(a2)
cnum1 = get_cnum(a1)
cnum2 = get_cnum(a2)
like image 666
deeenes Avatar asked Dec 01 '25 06:12

deeenes


1 Answers

The approach based on sorting has merit, but can be done a lot simpler:

def pair_counts(a, b):
    n = a.shape[0]  # also b.shape[0]

    counts_a = np.bincount(a)
    counts_b = np.bincount(b)
    sorter_a = np.argsort(a)

    n11 = 0
    same_a_offset = np.cumsum(counts_a)
    for indices in np.split(sorter_a, same_a_offset):
        b_check = b[indices]
        n11 += np.count_nonzero(b_check == b_check[:,None])

    n11 = (n11-n) // 2
    n10 = (np.sum(counts_a**2) - n) // 2 - n11
    n01 = (np.sum(counts_b**2) - n) // 2 - n11
    n00 = n**2 - n - n11 - n10 - n01

    return n11, n10, n01, n00

If this method is coded efficiently in Cython there's another speedup (probably ~20x) to be gained.


Edit:

I found a way to completely vectorize the procedure and lower the time complexity:

def sizes2count(a, n):
    return (np.inner(a, a) - n) // 2

def pair_counts_vec_nlogn(a, b):
    # Size of "11" clusters (a[i]==a[j] & b[i]==b[j])
    ab = a * b.max() + b  # make sure max(a)*max(b) fits the dtype!
    _, sizes = np.unique(ab, return_counts=True)

    # Calculate the counts for each type of pairing
    n = len(a)  # also len(b)
    n11 = sizes2count(sizes, n)
    n10 = sizes2count(np.bincount(a), n) - n11
    n01 = sizes2count(np.bincount(b), n) - n11
    n00 = n**2 - n - n11 - n10 - n01

    return n11, n10, n01, n00

def pair_counts_vec_linear(a, b):
    # Label "11" clusters (a[i]==a[j] & b[i]==b[j])
    ab = a * b.max() + b

    # Calculate the counts for each type of pairing
    n = len(a)  # also len(b)
    n11 = sizes2count(np.bincount(ab), n)
    n10 = sizes2count(np.bincount(a), n) - n11
    n01 = sizes2count(np.bincount(b), n) - n11
    n00 = n**2 - n - n11 - n10 - n01

    return n11, n10, n01, n00

Sometimes the O(n log(n)) algorithm is faster than the linear one, because the linear one uses max(a)*max(b) storage. Naming can probably be improved, I'm not really familiar with the terminology.

like image 107
user6758673 Avatar answered Dec 03 '25 19:12

user6758673



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!