When I calculate third order moments of an matrix X with N rows and n columns, I usually use einsum:
M3 = sp.einsum('ij,ik,il->jkl',X,X,X) /N
This works usually fine, but now I am working with bigger values, namely n = 120 and N = 100000, and einsum returns the following error:
ValueError: iterator is too large
The alternative of doing 3 nested loops is unfeasable, so I am wondering if there is any kind of alternative.
Note that calculating this will need to do at least ~n3 × N = 173 billion operations (not considering symmetry), so it will be slow unless numpy has access to GPU or something. On a modern computer with a ~3 GHz CPU, the whole computation is expected to take about 60 seconds to complete, assuming no SIMD/parallel speed up.
For testing, let's start with N = 1000. We will use this to check correctness and performance:
#!/usr/bin/env python3
import numpy
import time
numpy.random.seed(0)
n = 120
N = 1000
X = numpy.random.random((N, n))
start_time = time.time()
M3 = numpy.einsum('ij,ik,il->jkl', X, X, X)
end_time = time.time()
print('check:', M3[2,4,6], '= 125.401852515?')
print('check:', M3[4,2,6], '= 125.401852515?')
print('check:', M3[6,4,2], '= 125.401852515?')
print('check:', numpy.sum(M3), '= 218028826.631?')
print('total time =', end_time - start_time)
This takes about 8 seconds. This is the baseline.
Let's start with the 3 nested loop as the alternative:
M3 = numpy.zeros((n, n, n))
for j in range(n):
for k in range(n):
for l in range(n):
M3[j,k,l] = numpy.sum(X[:,j] * X[:,k] * X[:,l])
# ~27 seconds
This takes roughly half a minute, no good! One reason is because this is actually four nested loops: numpy.sum can also be considered a loop.
We note that the sum can be turned into a dot product to remove this 4th loop:
M3 = numpy.zeros((n, n, n))
for j in range(n):
for k in range(n):
for l in range(n):
M3[j,k,l] = X[:,j] * X[:,k] @ X[:,l]
# 14 seconds
Much better now but still slow. But we note that the the dot product can be changed into a matrix multiplication to remove one loop:
M3 = numpy.zeros((n, n, n))
for j in range(n):
for k in range(n):
M3[j,k] = X[:,j] * X[:,k] @ X
# ~0.5 seconds
Huh? Now this is even much more efficient than einsum! We could also check that the answer should indeed be correct.
Can we go further? Yes! We could eliminate the k loop by:
M3 = numpy.zeros((n, n, n))
for j in range(n):
Y = numpy.repeat(X[:,j], n).reshape((N, n))
M3[j] = (Y * X).T @ X
# ~0.3 seconds
We could also use broadcasting (i.e. a * [b,c] == [a*b, a*c] for each row of X) to avoid doing the numpy.repeat (thanks @Divakar):
M3 = numpy.zeros((n, n, n))
for j in range(n):
Y = X[:,j].reshape((N, 1))
## or, equivalently:
# Y = X[:, numpy.newaxis, j]
M3[j] = (Y * X).T @ X
# ~0.16 seconds
If we scale this to N = 100000 the program is expected to take 16 seconds, which is within the theoretical limit, so eliminating the j may not help too much (but that may make the code really hard to understand). We could accept this as the final solution.
Note: If you are using Python 2, a @ b is equivalent to a.dot(b).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With