Consider sequences of integers like (1,0,2,4,...,100). We have many(!) of them, and we need to hash them and it should be very fast in Python - so I am seeking methods using numpy/torch - i.e. without slow Python loops.
The simple and efficient method which works - is to generate random integer vector - and just consider the dot products - which can be implemented in just one matrix product:
np.matmul(array_of_sequences, random_int_vector )
Question 1: any ideas to do better ?
Any comments/ideas/suggestions are welcome !
C++ people suggested me something like that: https://www.kaggle.com/competitions/santa-2023/discussion/466399#2600835 but not sure it is Python relevant .
One of the troubles with the method above - it cannot work on old GPU like P100, which does not support multiplication of integer matrices. And when integers are converted to floats, we have another trouble: float multiplication may lose precision and hashing will not work correctly. When hash is float - it is more to explain why: https://www.kaggle.com/competitions/santa-2023/discussion/468370 . But for integer hashes there is also problem, that is more surprising.
This answer provide faster alternatives to np.matmul while using the same approach.
Here is the initial setup used to benchmark the code:
import numpy as np # Version: 1.24.3
# Note:
# Items of random_int_vector and array_of_sequences are 32-bit integers
# It might be 64-bit on some machine (eg. on Linux) so a manual conversion is needed
n = 50
seq_count = 10_000_000
random_int_vector = np.random.randint(0, 1000, n)
array_of_sequences = np.array([np.random.permutation(n) for i in range(seq_count)])
First of all, np.matmul is clearly not optimal in integer-based arrays. Indeed, Numpy calls the highly optimized BLAS primitive for floating-point arrays but it uses its own sub-optimal implementation for integer-based array so far. This implementation is sequential (like all Numpy internal functions not to be confused with BLAS ones). A faster implementation consists calling np.einsum instead which is usually a bit faster (but still sequential on my machine). Here is an example:
# 532 ms ± 564 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit np.matmul(array_of_sequences, random_int_vector)
# 228 ms ± 504 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
np.einsum('ij,j', array_of_sequences, random_int_vector)
While np.einsum is better, it is still sub-optimal and use 1 thread. We can speed up the computation using a more efficient compiled code and multiple thread thanks to Numba. Here is the resulting code:
import numba as nb # Version: 0.58.1
@nb.njit('(int32[:,::1],int32[::1])', parallel=True)
def faster_int_matmul(array_of_sequences, random_int_vector):
seq_count, n = array_of_sequences.shape
out = np.empty(seq_count, dtype=np.int32)
for i in nb.prange(seq_count):
s = 0
for j in range(n):
s += array_of_sequences[i, j] * random_int_vector[j]
out[i] = s
return out
# Sequential: 127 ms ± 1.38 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# Parallel: 58.1 ms ± 1.19 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
faster_int_matmul(array_of_sequences, random_int_vector)
This solution is nearly-optimal for 32-bit integer-based arrays on my machine. Indeed, the time to read array_of_sequences and store the output on a 40 GiB/s RAM like mine is at least 48.4 ms (optimal time on my CPU). In practice, mixed read/stores often causes the CPU to hardly saturate the RAM and overheads like page-faults also prevent the RAM to be fully saturated. In the end, this Numba implementation saturates 83% of my RAM. Put it shortly, it is memory-bound. This explains why the parallel implementation is only 2.2 times faster on my 6-core CPU (i5-9600KF).
The only solution to make that significantly faster is to operate on smaller integers like 16-bit integers. That being said, number should fit in 16-bit integers. Moreover, in practice, Numba generates a significantly less efficient code in this case (taking 93.3 ms) so it does not worth it. This seems to be due to a missed/inefficient use of SIMD instructions. While a lower-level C implementation can certainly fix this issue, one should keep in mind that the speed up will not be higher than x2 (since 16-bit integers are twice smaller and the operation was memory-bound on 32-bit ones).
The provided implementation should be efficient for significantly bigger n as long as random_int_vector fits in the L1/L2 cache. This means it should be fast for n < 10_000 (at least on on all mainstream CPUs).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With