The following lines of code
int nrows = 4096;
int ncols = 4096;
size_t numel = nrows * ncols;
unsigned char *buff = (unsigned char *) malloc( numel );
unsigned char *pbuff = buff;
#pragma omp parallel for schedule(static), firstprivate(pbuff, nrows, ncols), num_threads(1)
for (int i=0; i<nrows; i++)
{
for (int j=0; j<ncols; j++)
{
*pbuff += 1;
pbuff++;
}
}
take 11130 usecs to run on my i5-3230M when compiled with
g++ -o main main.cpp -std=c++0x -O3
That is, when the openmp pragmas are ignored.
On the other hand, it only takes 1496 usecs when compiled with
g++ -o main main.cpp -std=c++0x -O3 -fopenmp
This is more than 6 times faster, which is quite surprising taking into acount that it is run on a 2-core machine. In fact, I have also tested it with num_threads(1) and the performance improvement is still quite important (more than 3 times faster).
Anybody can help me to understand this behaviour?
EDIT: following the suggestions, I provide the full piece of code:
#include <stdlib.h>
#include <iostream>
#include <chrono>
#include <cassert>
int nrows = 4096;
int ncols = 4096;
size_t numel = nrows * ncols;
unsigned char * buff;
void func()
{
unsigned char *pbuff = buff;
#pragma omp parallel for schedule(static), firstprivate(pbuff, nrows, ncols), num_threads(1)
for (int i=0; i<nrows; i++)
{
for (int j=0; j<ncols; j++)
{
*pbuff += 1;
pbuff++;
}
}
}
int main()
{
// alloc & initializacion
buff = (unsigned char *) malloc( numel );
assert(buff != NULL);
for(int k=0; k<numel; k++)
buff[k] = 0;
//
std::chrono::high_resolution_clock::time_point begin;
std::chrono::high_resolution_clock::time_point end;
begin = std::chrono::high_resolution_clock::now();
//
for(int k=0; k<100; k++)
func();
//
end = std::chrono::high_resolution_clock::now();
auto usec = std::chrono::duration_cast<std::chrono::microseconds>(end-begin).count();
std::cout << "func average running time: " << usec/100 << " usecs" << std::endl;
return 0;
}
The answer, as it turns out, is that firstprivate(pbuff, nrows, ncols) effectively declares pbuff, nrows and ncols as local variables within the scope of the for loop. That in turn means the compiler can see nrows and ncols as constants - it cannot make the same assumption about global variables!
Consequently, with -fopenmp, you end up with the huge speedup because you aren't accessing a global variable each iteration. (Plus, with a constant ncols value, the compiler gets to do a bit of loop unrolling).
By changing
int nrows = 4096;
int ncols = 4096;
to
const int nrows = 4096;
const int ncols = 4096;
or by changing
for (int i=0; i<nrows; i++)
{
for (int j=0; j<ncols; j++)
{
*pbuff += 1;
pbuff++;
}
}
to
int _nrows = nrows;
int _ncols = ncols;
for (int i=0; i<_nrows; i++)
{
for (int j=0; j<_ncols; j++)
{
*pbuff += 1;
pbuff++;
}
}
the anomalous speedup vanishes - the non-OpenMP code is now just as fast as the OpenMP code.
The moral of the story? Avoid accessing mutable global variables inside performance-critical loops.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With