There was a post on this here: https://gist.github.com/JonathanRaiman/f2ce5331750da7b2d4e9 which shows a great speed improvement by just calling the Fortran libraries (BLAS / LAPACK / Intel MKL / OpenBLAS / whatever you installed with NumPy).  After many hours of working on this (because of deprecated SciPy libraries) I finally got it to compile with no results.  It was 2x faster than NumPy.  Unfortunately as another user pointed out, the Fortran routine is always adding the output matrix to the new results calculated, so it only matches NumPy on the 1st run.  I.e. A := alpha*x*y.T + A.  So that remains to be solved with a fast solution. 
[UPDATE: FOR THOSE OF YOU LOOKING TO USE THE SCIPY INTERFACE JUST GO HERE https://github.com/scipy/scipy/blob/master/scipy/linalg/cython_blas.pyx AS THEY ALREADY HAVE OPTIMIZED THE CALLS TO BLAS/LAPACK IN THE CPDEF STATEMENTS, JUST COPY / PASTE INTO YOUR CYTHON SCRIPT # Python-accessible wrappers for testing:  Also at the link above cython_lapack.pyx is available but doesn't have the Cython test scripts]
import numpy as np;
from cyblas import outer_prod;
a=np.random.randint(0,100, 1000);
b=np.random.randint(0,100, 1000);
a=a.astype(np.float64)
b=b.astype(np.float64)
cy_outer=np.zeros((a.shape[0],b.shape[0]));
np_outer=np.zeros((a.shape[0],b.shape[0]));
%timeit outer_prod(a,b,cy_outer)
#%timeit outer_prod(a,b) #use with fixed version instead of above line, results will automatically update cy_outer
%timeit np.outer(a,b, np_outer)
100 loops, best of 3: 2.83 ms per loop
100 loops, best of 3: 6.58 ms per loop
# END TEST SCRIPT
import cython
import numpy as np
cimport numpy as np
from cpython cimport PyCapsule_GetPointer 
cimport scipy.linalg.cython_blas
cimport scipy.linalg.cython_lapack
import scipy.linalg as LA
REAL = np.float64
ctypedef np.float64_t REAL_t
ctypedef np.uint64_t  INT_t
cdef int ONE = 1
cdef REAL_t ONEF = <REAL_t>1.0
ctypedef void (*dger_ptr) (const int *M, const int *N, const double *alpha, const double *X, const int *incX, double *Y, const int *incY, double *A, const int * LDA) nogil
cdef dger_ptr dger=<dger_ptr>PyCapsule_GetPointer(LA.blas.dger._cpointer, NULL)  # A := alpha*x*y.T + A
cpdef outer_prod(_x, _y, _output):
#cpdef outer_prod(_x, _y): #comment above line & use this to use the reset output matrix to zeros
    cdef REAL_t *x = <REAL_t *>(np.PyArray_DATA(_x))
    cdef int M = _y.shape[0]
    cdef int N = _x.shape[0]
    #cdef np.ndarray[np.float64_t, ndim=2, order='c'] _output = np.zeros((M,N)) #slow fix to uncomment to reset output matrix to zeros
    cdef REAL_t *y = <REAL_t *>(np.PyArray_DATA(_y))
    cdef REAL_t *output = <REAL_t *>(np.PyArray_DATA(_output))
    with nogil:
        dger(&M, &N, &ONEF, y, &ONE, x, &ONE, output, &M)
Much appreciated.  Hope this saves other people some time (it ALMOST works) - actually as I commented it works 1x and matches NumPy then every subsequent call adds to the result matrix AGAIN.  If I reset the output matrix to 0 and rerun the results then match NumPy.  Strange... although if one uncomments the few lines above it will work although only at NumPy speeds.  The alternative has been found memset and will be in another post... I just haven't figured out exactly how to call it yet.
According to netlib dger(M, N, ALPHA, X INCX, Y, INCY, A, LDA) performs A := alpha*x*y**T + A. So A should be all zeros to get the outer product of X and Y.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With