Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Cython loop over array of indexes

Tags:

python

cython

I would like to do a series of operations on particular elements of matrices. I need to define the indices of these elements in an external object (self.indices in the example below).

Here is a stupid example of implementation in cython :

%%cython -f -c=-O2 -I./ 

import numpy as np
cimport numpy as np

cimport cython


cdef class Test:
    
    cdef double[:, ::1] a, b
    cdef Py_ssize_t[:, ::1] indices
    
    def __cinit__(self, a, b, indices):
        self.a = a
        self.b = b
        self.indices = indices
    
    @cython.boundscheck(False)
    @cython.nonecheck(False)
    @cython.wraparound(False)
    @cython.initializedcheck(False)
    cpdef void run1(self):
        """ Use of external structure of indices. """
        cdef Py_ssize_t idx, ix, iy
        cdef int n = self.indices.shape[0]
        
        
        for idx in range(n):
            ix = self.indices[idx, 0]
            iy = self.indices[idx, 1]
            self.b[ix, iy] = ix * iy * self.a[ix, iy]

    @cython.boundscheck(False)
    @cython.nonecheck(False)
    @cython.wraparound(False)
    @cython.initializedcheck(False)
    cpdef void run2(self):
        """ Direct formulation """
        cdef Py_ssize_t idx, ix, iy
        cdef int nx = self.a.shape[0]
        cdef int ny = self.a.shape[1]
        
        for ix in range(nx):
            for iy in range(ny):
                self.b[ix, iy] = ix * iy * self.a[ix, iy]

with this on the python side :

import itertools
import numpy as np

N = 256
a = np.random.rand(N, N)
b = np.zeros_like(a)
indices = np.array([[i, j] for i, j in itertools.product(range(N), range(N))], dtype=int)
test = Test(a, b, indices)

and the results :

%timeit test.run1()
75.6 µs ± 1.51 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

%timeit test.run2()
41.4 µs ± 1.77 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

Why does the Test.run1() method run much slower than the Test.run2() method?

What are the possibilities to keep a similar level of performance as with Test.run2() by using an external list, array, or any other kind of structure of indices?

like image 934
Ipse Lium Avatar asked Oct 20 '25 04:10

Ipse Lium


1 Answers

Because run1 is significantly more complicated...

  1. run1 is having to read from two separate bits of memory which almost certainly makes the CPU cache less efficient.
  2. It's fairly trivial for the compiler to work out exactly what order it's accessing the array elements in run2. In contrast in run1 it could be accessing them in any order. That likely allows for significant optimizations.

Your current performance is probably as good as it gets.

like image 150
DavidW Avatar answered Oct 21 '25 16:10

DavidW



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!