Fast iteration over numpy array for squared residuals

Question

I like to least-squares match data (a numpy array of floats) with many known signal shapes. My code works but is too slow for the many runs I plan to do:

import numpy
import time

samples = 50000
width_signal = 100
data = numpy.random.normal(0, 1, samples)
signal = numpy.random.normal(0, 1, width_signal)  # Placeholder

t0 = time.clock()
for i in range(samples - width_signal):
    data_chunk = data[i:i + width_signal]
    residuals = data_chunk - signal
    squared_residuals = residuals**2
    summed_residuals = numpy.sum(squared_residuals)
t1 = time.clock()
print('Time elapsed (sec)', t1-t0)

EDIT: Corrected a mistake: First square residuals, then sum them.

This takes about 0.2 sec to run on my machine. As I have many datasets and signal shapes, this is too slow. My specific problem does not allow for typical MCMC methods because the signal shapes are too different. It has to be brute force.

Typical volumes are 50,000 floats for the data and 100 for the signal. These can vary by a factor of a few.

My tests show that:

The summing of the data numpy.sum(residuals) eats 90% of the time. I tried Python's sum(residuals) and it is faster for small arrays (~<50 elements) and slower for bigger arrays. Should I insert an if condition?
I tried numpy.roll() instead of fetching data directly, and .roll() is slower.

Questions:

Is there a logical improvement for speed-up?
Is there a faster way to sum arrays? I know no C, but if it is much faster I could try that.
Can a GPU help? I have many runs to do. If so, where could I find a code snippet to do this?

Divakar · Accepted Answer

Based on the various methods proposed in Compute mean squared, absolute deviation and custom similarity measure - Python/NumPy, we look to solve our case here.

Approach #1

We can leverage np.lib.stride_tricks.as_strided based scikit-image's view_as_windows to get sliding windows and thus have our first solution here, like so -

from skimage.util import view_as_windows

d = view_as_windows(data,(width_signal))-signal # diffs
out = np.einsum('ij,ij->i',d,d)

More info on use of as_strided based view_as_windows.

Approach #2

Again based on the matrix-multiplication trick in that answer post, we could better the performance, like so -

def MSD_strided(data, signal):
    w = view_as_windows(data,(width_signal))
    return (w**2).sum(1) + (signal**2).sum(0) - 2*w.dot(signal)

Approach #3

We will improve on approach#2 by bringing on uniform filtering and convolution -

from scipy.ndimage.filters import uniform_filter 

def MSD_uniffilt_conv(data, signal):
    hW = width_signal//2
    l = len(data)-len(signal)+1
    parte1 = uniform_filter(data**2,width_signal)[hW:hW+l]*width_signal
    parte3 = np.convolve(data, signal[::-1],'valid')    
    return parte1 + (signal**2).sum(0) - 2*parte3

Benchmarking

Timings on posted sample -

In [117]: %%timeit
     ...: for i in range(samples - width_signal + 1):
     ...:     data_chunk = data[i:i + width_signal]
     ...:     residuals = data_chunk - signal
     ...:     squared_residuals = residuals**2
     ...:     summed_residuals = numpy.sum(squared_residuals)
1 loop, best of 3: 239 ms per loop

In [118]: %%timeit
     ...: d = view_as_windows(data,(width_signal))-signal
     ...: np.einsum('ij,ij->i',d,d)
100 loops, best of 3: 11.1 ms per loop

In [209]: %timeit MSD_strided(data, signal)
10 loops, best of 3: 18.4 ms per loop

In [210]: %timeit MSD_uniffilt_conv(data, signal)
1000 loops, best of 3: 1.71 ms per loop

~140x speedup there with the third one!

max9111 · Answer

Apart from the versions given by Divakar you could simply use a compiler like Numba or Cython.

Exmaple

import numba as nb
@nb.njit(fastmath=True,parallel=True)
def sq_residuals(data,signal):
  summed_residuals=np.empty(data.shape[0]+1-signal.shape[0],dtype=data.dtype)
  for i in nb.prange(data.shape[0] - signal.shape[0]+1):
      sum=0.
      for j in range(signal.shape[0]):
        sum+=(data[i+j]-signal[j])**2
      summed_residuals[i]=sum
  return summed_residuals

Timings

Numba 0.4dev, Python 3.6, Quadcore i7
MSD_uniffilt_conv(Divakar): 2.4ms

after the first call which invokes some compilation overhead:
sq_residuals              : 1.7ms

Fast iteration over numpy array for squared residuals

Tags:

performance

python

arrays

numpy

Michael

2 Answers

Divakar

max9111

Recent Activity

Donate For Us

Fast iteration over numpy array for squared residuals

Tags:

performance

python

arrays

numpy

Michael

2 Answers

Divakar

max9111

Related questions

Recent Activity

Donate For Us