I have a 3 dimensional array (np.ndarray) which has mostly 0 in it. Now I want to sum them over the first dimension, but this is rather slow. I have looked into csr_matrix, but csr does not support 3 dimensional arrays. Is there a quicker way to sum an almost sparse nd array? Below is an excerpt from my current code.
Related question: sparse 3d matrix/array in Python? (creates a home made sparse ndarray Class, overkill?)
r = np.array([ [[1, 0, 0, 0],
[1, 0, 0, 0],
[0, 0, 1, 0]],
[[0, 1, 0, 0],
[0, 0, 0, 1],
[0, 0, 2, 0]],
[[0, 1, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0]],
[[0, 0, 0, 1],
[0, 0, 0, 0],
[0, 0, 0, 0]]], dtype=int)
np.sum(r,axis=0)
Out[35]:
array([[1, 2, 0, 1],
[1, 0, 0, 1],
[0, 0, 3, 0]])
EDIT
After hpaulj's answer below, I did some more timing tests, see below. It seems that reshaping does not do a lot of good for the sum, while transforming them into csr_matrix and back to numpy kills performance. I am still thinking about using the indices directly (below called rand_persons, rand_articles and rand_days, since also in my original problem, I make the big ndarray using these indices.
from timeit import timeit
from scipy.sparse import csr_matrix
import numpy as np
def create_test_data():
'''
dtype = int64
1% nonzero, 1000x1000x100: 1.3 s,
1% nonzero, 10000x1000x100: 13.3 s
0.1% nonzero, 10000x1000x100: 2.7 s
1ppm nonzero, 10000x1000x100: 0.007 s
'''
global purchases
N_persons = 10000
N_articles = 1000
N_days = 100
purchases = np.zeros(shape=(N_days, N_persons, N_articles), dtype=int)
N_elements = N_persons * N_articles * N_days
rand_persons = np.random.choice(a=range(N_persons), size=N_elements / 1e6, replace=True)
rand_articles = np.random.choice(a=range(N_articles), size=N_elements / 1e6, replace=True)
rand_days = np.random.choice(a=range(N_days), size=N_elements / 1e6, replace=True)
for (i, j, k) in zip(rand_persons, rand_articles, rand_days):
purchases[k, i, j] += 1
def sum_over_first_dim_A():
'''
0.1% nonzero, 10000x1000x99: 1.57s (average over 10)
1ppm nonzero, 10000x1000x99: 1.70s (average over 10)
'''
global purchases
d = purchases[:99, :, :]
np.sum(d, axis=0)
def sum_over_first_dim_B():
'''
0.1% nonzero, 10000x1000x99: 1.55s (average over 10)
1ppm nonzero, 10000x1000x99: 1.37s (average over 10)
'''
global purchases
d = purchases[:99, :, :]
(N_days, N_persons, N_articles) = d.shape
d.reshape(N_days, -1).sum(0).reshape(N_persons, N_articles)
def sum_over_first_dim_C():
'''
0.1% nonzero, 10000x1000x99: 7.54s (average over 10)
1ppm nonzero, 10000x1000x99: 7.44s (average over 10)
'''
global purchases
d = purchases[:99, :, :]
(N_days, N_persons, N_articles) = d.shape
r = csr_matrix(d.reshape(N_days, -1))
t = r.sum(axis=0)
np.reshape(t, newshape=(N_persons, N_articles))
if __name__ == '__main__':
print (timeit(create_test_data, number=10))
print (timeit(sum_over_first_dim_A, number=10))
print (timeit(sum_over_first_dim_B, number=10))
print (timeit(sum_over_first_dim_C, number=10))
EDIT 2
I now found a faster way to do the summation: I make a numpy array with sparse matrices. However, there is still some time in the initial creation of these matrices. I now do this with a loop. Is there a way to speed this up?
def create_test_data():
[ ... ]
'''
0.1% nonzero, 10000x1000x100: 2.1 s
1ppm nonzero, 10000x1000x100: 0.45 s
'''
global sp_purchases
sp_purchases = np.empty(N_days, dtype=lil_matrix)
for i in range(N_days):
sp_purchases[i] = lil_matrix((N_persons, N_articles))
for (i, j, k) in zip(rand_persons, rand_articles, rand_days):
sp_purchases[k][i, j] += 1
def sum_over_first_dim_D():
'''
0.1% nonzero, 10000x1000x99: 0.47s (average over 10)
1ppm nonzero, 10000x1000x99: 0.41s (average over 10)
'''
global sp_purchases
d = sp_purchases[:99]
np.sum(d)
You could reshape the array so it is 2d, do the sum, and then shape back
r.reshape(4,-1).sum(0).reshape(3,4) # == r.sum(0)
That reshaping does not add much processing time. And you can convert that 2d to sparse, and see if that saves any time. My guess is that your array will have to be very large, and very sparse, to beat the straight numpy sum. If you have other reasons to use the sparse format it may be worth it, but simply to do this sum, no. But test it yourself.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With