Having some trouble optimizing a function that returns the number of neighbors of a cell in a Conway's Game of Life implementation. I'm trying to learn C and just get better at coding. I'm not very good at recognizing potential optimizations, and I've spent a lot of time online reading various methods but it's not really clicking for me yet.
Specifically I'm trying to figure out how to unroll this nested for loop in the most efficient way, but each time I try I just make the runtime longer. I'm including the function, I don't think any other context is needed. Thanks for any advice you can give!
Here is the code for the countNeighbors() function:
static int countNeighbors(board b, int x, int y)
{
   int n = 0;
   int x_left = max(0, x-1);
   int x_right = min(HEIGHT, x+2);
   int y_left = max(0, y-1);
   int y_right = min(WIDTH, y+2);
   int xx, yy;
   for (xx = x_left; xx < x_right; ++xx) {
       for (yy = y_left; yy < y_right; ++yy) {
           n += b[xx][yy];
       }
   }
   return n - b[x][y];
}
Instead of declaring board as b[WIDTH][HEIGHT] declare it as b[WIDTH + 2][HEIGHT + 2]. This gives an extra margin which will have zeros, but it prevents from index out of bounds. So, instead of:
 x x
 x x
We will have:
 0 0 0 0 
 0 x x 0
 0 x x 0
 0 0 0 0 
x denotes used cells, 0 will be unused. 
Typical trade off: a bit of memory for speed.
Thanks to that we don't have to call min and max functions (which have bad for performance if statements).
Finally, I would write your function like that:
int countNeighborsFast(board b, int x, int y)
{
    int n = 0;
    n += b[x-1][y-1];
    n += b[x][y-1];
    n += b[x+1][y-1];
    n += b[x-1][y];
    n += b[x+1][y];
    n += b[x-1][y+1];
    n += b[x][y+1];
    n += b[x+1][y+1];
    return n;
}
Full, working source code.
Thanks to Jongware comment I added linearization (reducing array's dimensions from 2 to 1) and changing int to char.
I also made the main loop linear and calculate the returned sum directly, without an intermediate n variable.
2D array was 10002 x 10002, 1D had 100040004 elements.
The CPU I have is Pentium Dual-Core T4500 at 2.30 GHz, further details  here (output of cat /prof/cpuinfo).
Results on default optimization level O0:
Original: 15.50s
Mine: 10.13s
Linear: 2.51s
LinearAndChars: 2.48s
LinearAndCharsAndLinearLoop: 2.32s
LinearAndCharsAndLinearLoopAndSum: 1.53s
That's about 10x faster compared to the original version.
Results on O2:
Original: 6.42s
Mine: 4.17s
Linear: 0.55s
LinearAndChars: 0.53s
LinearAndCharsAndLinearLoop: 0.42s
LinearAndCharsAndLinearLoopAndSum: 0.44s
About 15x faster.
On O3:
Original: 10.44s
Mine: 1.47s
Linear: 0.26s
LinearAndChars: 0.26s
LinearAndCharsAndLinearLoop: 0.25s
LinearAndCharsAndLinearLoopAndSum: 0.24s
About 44x faster.
The last version, LinearAndCharsAndLinearLoopAndSum is:
typedef char board3[(HEIGHT + 2) * (WIDTH + 2)];
int i;
for (i = WIDTH + 3; i <= (WIDTH + 2) * (HEIGHT + 1) - 2; i++)
    countNeighborsLinearAndCharsAndLinearLoopAndSum(b3, i);
int countNeighborsLinearAndCharsAndLinearLoopAndSum(board3 b, int pos)
{
    return
    b[pos - 1 - (WIDTH + 2)] +
    b[pos     - (WIDTH + 2)] +
    b[pos + 1 - (WIDTH + 2)] +
    b[pos - 1] +
    b[pos + 1] +
    b[pos - 1 + (WIDTH + 2)] +
    b[pos     + (WIDTH + 2)] +
    b[pos + 1 + (WIDTH + 2)];
}
Changing 1 + (WIDTH + 2) to WIDTH + 3 won't help, because compiler takes care of it anyway (even on O0 optimization level).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With