I'm trying to implement a convolutional neural network in Python. Originally, I was using scipy.signal's convolve2d function to do the convolution, but it has a lot of overhead, and it would be faster to just implement my own algorithm in C and call it from python, since I know what my input looks like.
I've implemented 2 functions:
Neither of these functions has padding since I require dimensionality reduction.
// a - 2D matrix (as a 1D array), w - kernel
double* conv2(double* a, double* w, double* result)
{
    register double acc;
    register int i; 
    register int j;
    register int k1, k2;
    register int l1, l2;
    register int t1, t2;
    for(i = 0; i < RESULT_DIM; i++) 
    {
        t1 = i * RESULT_DIM; // loop invariants
        for(j = 0; j < RESULT_DIM; j++) 
        {   
            acc = 0.0;
            for(k1 = FILTER_DIM - 1, k2 = 0; k1 >= 0; k1--, k2++)
            {
                t2 = k1 * FILTER_DIM;  // loop invariants
                for(l1 = FILTER_DIM - 1, l2 = 0; l1 >= 0; l1--, l2++)
                {
                    acc += w[t2 + l1] * a[(i + k2) * IMG_DIM + (j + l2)];
                }
            }
            result[t1 + j] = acc;
        }
    }
    return result;
}
// a - 2D matrix, w1, w2 - the separated 1D kernels
double* conv2sep(double* a, double* w1, double* w2, double* result)
{
    register double acc;
    register int i; 
    register int j;
    register int k1, k2;
    register int t;
    double* tmp = (double*)malloc(IMG_DIM * RESULT_DIM * sizeof(double));
    for(i = 0; i < RESULT_DIM; i++) // convolve with w1 
    {
        t = i * RESULT_DIM;
        for(j = 0; j < IMG_DIM; j++)
        {
            acc = 0.0;
            for(k1 = FILTER_DIM - 1, k2 = 0; k1 >= 0; k1--, k2++)
            {
                acc += w1[k1] * a[k2 * IMG_DIM + t + j];
            }
            tmp[t + j] = acc;
        }
    }
    for(i = 0; i < RESULT_DIM; i++) // convolve with w2
    {
        t = i * RESULT_DIM;
        for(j = 0; j < RESULT_DIM; j++)
        {
            acc = 0.0;
            for(k1 = FILTER_DIM - 1, k2 = 0; k1 >= 0; k1--, k2++)
            {
                acc += w2[k1] * tmp[t + (j + k2)];
            }
            result[t + j] = acc;
        }
    }
    free(tmp);
    return result;
}
Compiling with gcc's -O3 flag and testing on a 2.7GHz Intel i7, using a 4000x4000 matrix and 5x5 kernel, I get respectively (avg of 5):
271.21900 ms
127.32000 ms
This is still a considerable improvement over scipy.signal's convolve2d which takes around 2 seconds for the same operation, but I need more speed since I'll be calling this function thousands of times. Changing the data type to float isn't an option at the moment, even though it'd cause a considerable speedup.
Is there a way I can optimise these algorithms further? Can I apply any cache tricks or routines to speed it up?
Any suggestions would be appreciated.
If you're running on x86 only then consider using SSE or AVX SIMD optimisation. For double data the throughput improvement will be modest, but if you can switch to float then you may be able to get to around 4x improvement with SSE or 8x with AVX. There are a number of questions and answers about this very topic on StackOverflow already from which you may be able to get some ideas on the implementation. Alternatively there are also many libraries available which include high performance 2D convolution (filtering) routines, and these typically exploit SIMD for performance, e.g. Intel's IPP (commercial), or OpenCV (free).
Another possibility is to exploit multiple cores - split your image into blocks and run each block in its own thread. E.g. if you have a 4 core CPU then split your image into 4 blocks. (See pthreads).
You can of course combine both of the above ideas, if you really want to fully optimise this operation.
Some small optimisations which you can apply to your current code, and to any future implementations (e.g. SIMD):
if your kernels are symmetric (or odd-symmetric) then you can reduce the number of operations by adding (subtracting) symmetric input values and performing one multiply rather than two
for the separable case, rather than allocating a full frame temporary buffer, consider using a "strip-mining" approach - allocate a smaller buffer, which is full width, but a relatively small number of rows, then process your image in "strips", alternately applying the horizontal kernel and the vertical kernel. The advantage of this is that you have a much more cache-friendly access pattern and a smaller memory footprint.
A few comments on coding style:
the register keyword has been redundant for many years, and modern compilers will emit a warning if you try to you use it - save yourself some noise (and some typing) by ditching it
casting the result of malloc in C is frowned upon - it's redundant and potentially dangerous.
make any input parameters const (i.e. read-only) and use restrict for any parameters which can never be aliased (e.g. a and result) - this can not only help to avoid programming errors (at least in the case of const), but in some cases it can help the compiler to generate better optimised code (particularly in the case of potentially aliased pointers). 
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With