Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

(memory-)efficient operations between arbitrary columns of numpy array

I have a large 2D numpy array. I would like to be able to efficiently run row-wise operations on subsets of the columns, without copying the data.

In what follows, a = np.arange(1000000).reshape(1000, 10000) and columns = np.arange(1, 1000, 2). For reference,

In [4]: %timeit a.sum(axis=1)
7.26 ms ± 431 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

The approaches I am aware of are:

  1. fancy indexing with list of columns
In [5]: %timeit a[:, columns].sum(axis=1)
42.5 ms ± 197 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
  1. fancy indexing with mask of columns
In [6]: cols_mask = np.zeros(10000, dtype=bool)
   ...: cols_mask[columns] = True                                                                                                                                                                                                                                                                                             

In [7]: %timeit a[:, cols_mask].sum(axis=1)
42.1 ms ± 302 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
  1. masked array
In [8]: cells_mask = np.ones((1000, 10000), dtype=bool)

In [9]: cells_mask[:, columns] = False

In [10]: am = np.ma.masked_array(a, mask=cells_mask)

In [11]: %timeit am.sum(axis=1)
80 ms ± 2.71 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
  1. python loop
In [12]: %timeit sum([a[:, i] for i in columns])
31.2 ms ± 531 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Somewhat surprisingly to me, the last approach is the most efficient: moreover, it avoids copying the full data, which for me is a prerequisite. However, it is still much slower than the simple sum (on double the data size), and most importantly, it is not trivial to generalize to other operations (e.g., cumsum).

Is there any approach I am missing? I would be fine with writing some cython code, but I would like the approach to work for any numpy function, not just sum.

like image 282
Pietro Battiston Avatar asked Dec 06 '25 14:12

Pietro Battiston


1 Answers

On this one pythran seems a bit faster than numba at least on my rig:

import numpy as np

#pythran export col_sum(float[:,:], int[:])
#pythran export col_sum(int[:,:], int[:])

def col_sum(data, idx):
    return data.T[idx].sum(0)

Compile with pythran <filename.py>

Timings:

timeit(lambda:cs_pythran.col_sum(a, columns),number=1000)
# 1.644187423051335
timeit(lambda:cs_numba.col_sum(a, columns),number=1000)
# 2.635075871949084
like image 156
Paul Panzer Avatar answered Dec 09 '25 04:12

Paul Panzer