incomprehensible performance improvement with openmp even when num_threads(1)

Question

The following lines of code

int nrows = 4096;
int ncols = 4096;
size_t numel = nrows * ncols;
unsigned char *buff = (unsigned char *) malloc( numel );

unsigned char *pbuff = buff;
#pragma omp parallel for schedule(static), firstprivate(pbuff, nrows, ncols), num_threads(1)
for (int i=0; i<nrows; i++)
{
    for (int j=0; j<ncols; j++)
    {
        *pbuff += 1;
        pbuff++;
    }
}

take 11130 usecs to run on my i5-3230M when compiled with

g++ -o main main.cpp -std=c++0x -O3

That is, when the openmp pragmas are ignored.

On the other hand, it only takes 1496 usecs when compiled with

g++ -o main main.cpp -std=c++0x -O3 -fopenmp

This is more than 6 times faster, which is quite surprising taking into acount that it is run on a 2-core machine. In fact, I have also tested it with num_threads(1) and the performance improvement is still quite important (more than 3 times faster).

Anybody can help me to understand this behaviour?

EDIT: following the suggestions, I provide the full piece of code:

#include <stdlib.h>
#include <iostream>

#include <chrono>
#include <cassert>


int nrows = 4096;
int ncols = 4096;
size_t numel = nrows * ncols;
unsigned char * buff;


void func()
{
    unsigned char *pbuff = buff;
    #pragma omp parallel for schedule(static), firstprivate(pbuff, nrows, ncols), num_threads(1)
    for (int i=0; i<nrows; i++)
    {
        for (int j=0; j<ncols; j++)
        {
            *pbuff += 1;
            pbuff++;
        }
    }
}


int main()
{
    // alloc & initializacion
    buff = (unsigned char *) malloc( numel );
    assert(buff != NULL);
    for(int k=0; k<numel; k++)
        buff[k] = 0;

    //
    std::chrono::high_resolution_clock::time_point begin;
    std::chrono::high_resolution_clock::time_point end;
    begin = std::chrono::high_resolution_clock::now();      
    //
    for(int k=0; k<100; k++)
        func();
    //
    end = std::chrono::high_resolution_clock::now();
    auto usec = std::chrono::duration_cast<std::chrono::microseconds>(end-begin).count();
    std::cout << "func average running time: " << usec/100 << " usecs" << std::endl;

    return 0;
}

nneonneo · Accepted Answer

The answer, as it turns out, is that firstprivate(pbuff, nrows, ncols) effectively declares pbuff, nrows and ncols as local variables within the scope of the for loop. That in turn means the compiler can see nrows and ncols as constants - it cannot make the same assumption about global variables!

Consequently, with -fopenmp, you end up with the huge speedup because you aren't accessing a global variable each iteration. (Plus, with a constant ncols value, the compiler gets to do a bit of loop unrolling).

By changing

int nrows = 4096;
int ncols = 4096;

to

const int nrows = 4096;
const int ncols = 4096;

or by changing

for (int i=0; i<nrows; i++)
{
    for (int j=0; j<ncols; j++)
    {
        *pbuff += 1;
        pbuff++;
    }
}

to

int _nrows = nrows;
int _ncols = ncols;
for (int i=0; i<_nrows; i++)
{
    for (int j=0; j<_ncols; j++)
    {
        *pbuff += 1;
        pbuff++;
    }
}

the anomalous speedup vanishes - the non-OpenMP code is now just as fast as the OpenMP code.

The moral of the story? Avoid accessing mutable global variables inside performance-critical loops.

incomprehensible performance improvement with openmp even when num_threads(1)

Tags:

c++

openmp

nuhnuh

1 Answers

nneonneo

Recent Activity

Donate For Us

incomprehensible performance improvement with openmp even when num_threads(1)

Tags:

c++

openmp

nuhnuh

1 Answers

nneonneo

Related questions

Recent Activity

Donate For Us