Cython loop over array of indexes

Question

I would like to do a series of operations on particular elements of matrices. I need to define the indices of these elements in an external object (self.indices in the example below).

Here is a stupid example of implementation in cython :

%%cython -f -c=-O2 -I./ 

import numpy as np
cimport numpy as np

cimport cython


cdef class Test:
    
    cdef double[:, ::1] a, b
    cdef Py_ssize_t[:, ::1] indices
    
    def __cinit__(self, a, b, indices):
        self.a = a
        self.b = b
        self.indices = indices
    
    @cython.boundscheck(False)
    @cython.nonecheck(False)
    @cython.wraparound(False)
    @cython.initializedcheck(False)
    cpdef void run1(self):
        """ Use of external structure of indices. """
        cdef Py_ssize_t idx, ix, iy
        cdef int n = self.indices.shape[0]
        
        
        for idx in range(n):
            ix = self.indices[idx, 0]
            iy = self.indices[idx, 1]
            self.b[ix, iy] = ix * iy * self.a[ix, iy]

    @cython.boundscheck(False)
    @cython.nonecheck(False)
    @cython.wraparound(False)
    @cython.initializedcheck(False)
    cpdef void run2(self):
        """ Direct formulation """
        cdef Py_ssize_t idx, ix, iy
        cdef int nx = self.a.shape[0]
        cdef int ny = self.a.shape[1]
        
        for ix in range(nx):
            for iy in range(ny):
                self.b[ix, iy] = ix * iy * self.a[ix, iy]

with this on the python side :

import itertools
import numpy as np

N = 256
a = np.random.rand(N, N)
b = np.zeros_like(a)
indices = np.array([[i, j] for i, j in itertools.product(range(N), range(N))], dtype=int)
test = Test(a, b, indices)

and the results :

%timeit test.run1()
75.6 µs ± 1.51 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

%timeit test.run2()
41.4 µs ± 1.77 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

Why does the Test.run1() method run much slower than the Test.run2() method?

What are the possibilities to keep a similar level of performance as with Test.run2() by using an external list, array, or any other kind of structure of indices?

DavidW · Accepted Answer

Because run1 is significantly more complicated...

run1 is having to read from two separate bits of memory which almost certainly makes the CPU cache less efficient.
It's fairly trivial for the compiler to work out exactly what order it's accessing the array elements in run2. In contrast in run1 it could be accessing them in any order. That likely allows for significant optimizations.

Your current performance is probably as good as it gets.

Cython loop over array of indexes

Tags:

python

cython

Ipse Lium

1 Answers

DavidW

Recent Activity

Donate For Us

Cython loop over array of indexes

Tags:

python

cython

Ipse Lium

1 Answers

DavidW

Related questions

Recent Activity

Donate For Us