Given two large arrays,
A = np.random.randint(10,size=(10000,2))
B = np.random.randint(10,size=(10000,2))
I would like to determine if any of the vectors have a cross product of zero. We could do
C = np.cross(A[:,None,:],B[None,:,:])
and then check if C contains a 0 or not.
not C.all()
However, this process requires calculating all the cross products which can be time consuming. Instead, I would prefer to let numpy perform the cross product, but IF a zero is reached at any point, then simply cut the whole operation and end early. Does numpy have such an "early termination" operation that will cut numpy operations early if they reach a condition? Something like,
np.allfunc()
np.anyfunc()
The example above is such a case where A and B have an extremely high likelihood of having a zero cross product at some point (in fact is very likely to occur near the start), so much so, that performing a python-for-loop (yuck!) is much faster than using numpy's highly optimized code.
In general, what is the fastest way to determine if A and B have a zero cross product?
Does numpy have such an "early termination" operation that will cut numpy operations early if they reach a condition?
Put it simply, no, Numpy does not. In practice, this is a bit more complex than that.
In fact, the allfunc and anyfunc function you propose similar to np.all and np.any. Such function does some kind of early termination. Here is a proof on my machine:
v = np.random.randint(0, 2, 200*1024*1024) > 0
# 2.74 ms ± 61.5 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit -n 10 v.all()
# 3.16 ms ± 46.5 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
v[0:v.size//16] = True
%timeit -n 10 v.all()
# 3.72 ms ± 77.7 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
v[0:v.size//8] = True
%timeit -n 10 v.all()
# 4.64 ms ± 80.4 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
v[0:v.size//4] = True
%timeit -n 10 v.all()
# 6.67 ms ± 127 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
v[0:v.size//2] = True
%timeit -n 10 v.all()
# 10.5 ms ± 69.9 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
v[0:v.size] = True
%timeit -n 10 v.all()
The former is so fast that it cannot possibly iterate over the whole array since the memory throughput would reach 71 GiB/s while my RAM can only reach the theoretical maximum bandwidth of 47.7 GiB/s (about 40 GiB/s in practice). My CPU cache is far from being big enough to store a significant part of this array and the array takes 200 MiB in RAM so each boolean takes 1 byte in RAM. Strangely, the speed of all and any is quite disappointing in the first case since v[2] is False in practice. Numpy certainly perform the computation chunk by chunk and perform the early termination only after computing a chunk (if needed).
The problem with any and all is that they require a boolean array in parameter. Thus, the boolean array need to be entirely computed first defeating the benefit of any early termination.
Alternatively, Numpy provides ufuncs that are designed to operate on arrays in an element-by-element fashion. An example of ufunc is np.ufunc.reduce. They could be used to compose algorithms in such a way early termination would theoretically be possible. This solution would be significantly faster than usual non-ufunc operations when termination is possible pretty early. However, when the termination is only possible lately, ufuncs would be much slower than non-ufunc operations. In fact, ufuncs are (currently) fundamentally inefficient compared to non-ufunc operations. This is because ufuncs require an expensive (native) function call per item. This function call, applied on a scalar items, prevent several critical optimizations like the use of SIMD instructions. Still, ufuncs can often be faster than using pure-Python code element-by-element thanks to "vectorization". The current way ufuncs can be composed is AFAIK not sufficient, in your use-case, to write an efficient code benefiting from early termination.
The standard way to do early termination in Numpy is to split the computation in many chunks (manually). This solution is generally fast because it is vectorized, often SIMD-friendly, and cache efficient (assuming chunks are small enough). The overhead of Numpy is small as long as chunks are big enough. Here is an example:
def compute_by_chunk(A, B):
chunkSize = 256
for i in range(0, A.size, chunkSize):
for j in range(0, B.size, chunkSize):
C = np.cross(A[i:i+chunkSize,None,:], B[None,j:j+chunkSize,:])
if not C.all():
# print(f'Zero found in chunk ({i},{j})')
return True
return False
For your example of input, it takes 239 µs on my machine while the naive Numpy implementation takes 234 ms. This is about 1000 times faster in this case. With smaller chunks, like 32x32, the computation is even faster: 32.3 µs.
Actually, smaller chunks are better when there are a lot of zeros (since it takes time to compute the first chunk). 256x256 should be relatively good on most machines (rather cache friendly, memory efficient and with low Numpy overhead).
A more simpler solution (which is also generally faster) is to use modules/tools, like Cython or Numba, compiling your code so loops can be fast thanks to compilation to native functions. However, with this solution, you need to reimplement the cross product yourself. Indeed, Numba does not support it yet and Cython will not be faster (than a pure-Python code using Numpy) if you just call Numpy functions in loops.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With