C++ parallel average matrix with OpenMP

Question

I have this code in C++ that calculates the average of each column of a matrix. I want to parallelize the code using OpenMP.

#include <vector>
#include <cstdlib>
#include <chrono>
#include <iostream>
using namespace std;

vector<double> average(const vector<vector<unsigned char>>& original){
  vector<vector<double>> result(original.size(), vector<double>(original[0].size()));
  vector<double> average(original[0].size(), 0.0);

  for (int i=0; i<original.size(); i++) {
    const vector<unsigned char>& vector = original[i];
    for (int k = 0; k < vector.size(); ++k) {
      average[k] += vector[k];
    }
  }
  for (double& val : average) {
    val /= original.size();
  }

  return average;
}

Adding #pragma omp parallel for before the outer for loop gets me bogus results. Do you have any pointers? I thought I would find tons of examples of this online but wasn't able to find much. This is my first time using OpenMP.

Richard · Accepted Answer

Frank is right in saying that your immediate problem may be that you're using a non-atomic operation:

average[k] += vector[k];

You can fix that by using:

#pragma omp atomic
average[k] += vector[k];

But a larger conceptual problem is that this probably will not speed up your code. The operations you are doing a very simple and your memory (at least the rows) are contiguous.

Indeed, I have made a Minimum Working Example for your code (you should have done this for your question):

#include <vector>
#include <cstdlib>
#include <chrono>
#include <iostream>
using namespace std;

vector<float> average(const vector<vector<unsigned char>>& original){
  vector<float> average(original[0].size(), 0.0);

  #pragma omp parallel for
  for (int i=0; i<original.size(); i++) {
    const vector<unsigned char>& vector = original[i];
    for (int k = 0; k < vector.size(); ++k) {
      #pragma omp atomic
      average[k] += vector[k];
    }
  }
  for (float& val : average) {
    val /= original.size();
  }

  return average;
}

int main(){
  vector<vector<unsigned char>> mat(1000);
  for(int y=0;y<mat.size();y++)
  for(int x=0;x<mat.size();x++)
    mat.at(y).emplace_back(rand()%255);

  std::chrono::steady_clock::time_point begin = std::chrono::steady_clock::now();
  double dont_optimize = 0;
  for(int i=0;i<100;i++){
    auto ret = average(mat);
    dont_optimize += ret[0];
  }
  std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now();

  std::cout<<"Time = "<<(std::chrono::duration_cast<std::chrono::microseconds>(end - begin).count()/100)<<std::endl;

  return 0;
}

Compiling this with g++ -O3 temp.cpp -fopenmp enables OpenMP. Run-times on my four-core machine are consistently about 10,247 microseconds. When I disable OpenMP, run-times are about 2,561 microseconds.

Starting and managing a thread team is expensive.

But there is a real way to speed up your code: improve your memory layout.

Using a std::vector< std::vector<T> > design means that each vector<T> can be located anywhere in memory. Rather, we would like all our memory nice and contiguous. We can achieve this by using flat-array indexing, like so:

(Note that an earlier version of the below code used, e.g., mat.at(y*width+x). The range checking this implies results in a significant loss of speed versus using mat[y*width+x], as the code now does. Times have been updated appropriately.)

#include <vector>
#include <cstdlib>
#include <chrono>
#include <iostream>
using namespace std;

class Matrix {
 public:
  vector<unsigned char> mat;
  int width;
  int height;
  Matrix(int width0, int height0){
    width  = width0;
    height = height0;
    for(int i=0;i<width*height;i++)
      mat.emplace_back(rand()%255);
  }
  unsigned char& operator()(int x, int y){
    return mat[y*width+x];
  }
  unsigned char operator()(int x, int y) const {
    return mat[y*width+x];
  }
  unsigned char& operator()(int i){
    return mat[i];
  }
  unsigned char operator()(int i) const {
    return mat[i];
  }
};

vector<float> average(const Matrix& original){
  vector<float> average(original.width, 0.0);

  #pragma omp parallel for
  for(int y=0;y<original.height;y++)
  for(int x=0;x<original.width;x++)
    #pragma omp atomic
    average[x] += original(x,y);

  for (float& val : average) 
    val /= original.height;

  return average;
}

int main(){
  Matrix mat(1000,1000);

  std::cerr<<mat.width<<" "<<mat.height<<std::endl;

  std::chrono::steady_clock::time_point begin = std::chrono::steady_clock::now();
  double dont_optimize = 0;
  for(int i=0;i<100;i++){
    auto ret = average(mat);
    dont_optimize += ret[0];
  }
  std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now();

  std::cout<<"Time = "<<(std::chrono::duration_cast<std::chrono::microseconds>(end - begin).count()/100)<<std::endl;

  return 0;
}

Note that I am also using float instead of double: you can cram twice the numbers into the same amount of space this way, which is good for caching.

This gives run-times of 292 microseconds without OpenMP and 9426 microseconds with OpenMP.

In conclusion, using OpenMP/parallelism slows down your code because the work being done in parallel takes less time than setting up the parallelism, but using a better memory layout gives a ~90% increase in speed. In addition, using the handy Matrix class I create improves your code's readability and maintainability.

Edit:

Running this on matrices that are 10,000x10,000 instead of 1,000x1,000 gives similar results. For the vector of vectors: 7,449 microseconds without OpenMP and 156,316 microseconds with OpenMP. For flat-array indexing: 32,668 miroseconds without OpenMP and 145,470 microseconds with OpenMP.

The performance may be related to the hardware instruction set available on my machine (particularly, if my machine lacks atomic instructions then OpenMP will have to simulate them with mutexes and such). Indeed, in the flat-array example compiling with -march=native gives improved, though still not-great, performance for OpenMP: 33,079 microseconds without OpenMP and 127,841 microseconds with OpenMP. I will experiment with a more powerful machine later.

Edit

While the aforementioned testing was performed on the Intel(R) Core(TM) i5 CPU M 480 @ 2.67GHz, I have compiled this code (with -O3 -march=native) on the badass Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz. The results are similar:

1000x1000, Vector of vectors, without OpenMP: 145μs
1000x1000, Vector of vectors, with OpenMP: 2,941μs
10000x10000, Vector of vectors, without OpenMP: 20,254μs
10000x10000, Vector of vectors, with OpenMP: 85,703μs
1000x1000, Flat array, without OpenMP: 139μs
1000x1000, Flat array, with OpenMP: 3,171μs
10000x10000, Flat array, without OpenMP: 18,712μs
10000x10000, Flat array, with OpenMP: 89,097μs

This confirms our earlier result: using OpenMP for this task tends to slow things down, even if your hardware is amazing. In fact, most of the speed-up between the two processors is likely due to the Xeon's large L3 cache size: at 30,720K it's 10x larger than the 3,720K cache on the i5.

Edit

Incorporating Zulan's reduction strategy from their answer below allows us to efficiently leverage parallelism:

vector<float> average(const Matrix& original){
  vector<float> average(original.width, 0.0);
  auto average_data = average.data();

  #pragma omp parallel for reduction(+ : average_data[ : original.width])
  for(int y=0;y<original.height;y++){
    for(int x=0;x<original.width;x++)
      average_data[x] += original(x,y);
  }

  for (float& val : average) 
    val /= original.height;

  return average;
}

For 24 threads, this gives 2629 microseconds on the 10,000x10,000 arrays: a 7.1x improvement over the serial version. Using Zulan's strategy on your original code (without the flat-array indexing) gives 3529 microseconds, so we're still getting a 25% speed-up by using better layouts.

Zulan · Answer

Frank and Richard have the basic issue right. The hint about memory layout is also true. However, it is possible to do much better than using atomic. Not only is atomic data access quite expensive, by writing to a completly shared memory space from all threads, cache performance goes way down. So a parallel loop with nothing but an atomic increment is likely to not scale well.

Reduction

The basic idea is to first compute a local sum vector, and then sum those vectors up safely later. That way, most of the work can be done independently and efficiently. Recent OpenMP versions make it quite convenient to do that.

Here is the example code, base on Richards example - I do however swap the indexes and fix the operator() efficiency.

#include <chrono>
#include <cstdlib>
#include <iostream>
#include <memory>
#include <vector>

class Matrix {
public:
  std::vector<unsigned char> mat;
  int width;
  int height;
  Matrix(int width0, int height0) {
    srand(0);
    width = width0;
    height = height0;
    for (int i = 0; i < width * height; i++)
      mat.emplace_back(rand() % 255);
  }
  unsigned char &operator()(int row, int col) { return mat[row * width + col]; }
  unsigned char operator()(int row, int col) const {
    // do not use at here, the extra check is too expensive for the tight loop
    return mat[row * width + col];
  }
};

std::vector<float> __attribute__((noinline)) average(const Matrix &original) {
  std::vector<float> average(original.width, 0.0);
  // We can't do array reduction directly on vectors
  auto average_data = average.data();

  #pragma omp parallel reduction(+ : average_data[ : original.width])
  {
    #pragma omp for
    for (int row = 0; row < original.height; row++) {
      for (int col = 0; col < original.width; col++) {
        average_data[col] += original(row, col);
      }
    }
  }
  for (float &val : average) {
    val /= original.height;
  }
  return average;
}

int main() {
  Matrix mat(500, 20000);

  std::cerr << mat.width << " " << mat.height << std::endl;

  std::chrono::steady_clock::time_point begin = chrono::steady_clock::now();
  double dont_optimize = 0;
  for (int i = 0; i < 100; i++) {
    auto ret = average(mat);
    dont_optimize += ret[0];
  }
  std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now();

  std::cout << "Time = "
            << (std::chrono::duration_cast<std::chrono::microseconds>(end-begin).count() / 100.)
            << "
" << optimize << std::endl;
  return 0;
}

For your given matrix size , this reduces the time from ~1.8 ms to ~0.3 ms with 12 threads on a Intel Xeon E5-2680 v3 at 2.5 GHz nominal frequency.

Switching the loop

Alternatively, you could parallelize the inner loop, as it's iterations are independent of each other. However, that would be slower due to the small chunks of work for each thread. Then you can swap the inner and outer loop, but that makes memory access non-contiguous, which is also bad for performance. So the best for that approach is to split the inner loop like such:

constexpr size_t chunksize = 128;
#pragma omp parallel for
for (size_t col_chunk = 0; col_chunk < original.width; col_chunk += chunksize) {
  for (size_t row = 0; row < original.height; row++) {
    const auto col_end = std::min(col_chunk + chunksize, original.width);
    for (size_t col = col_chunk; col < col_end; col++) {

This gives you reasonable contiguous memory access while avoiding all interaction between threads. However, there still may be some false sharing at the border of threads. I haven't been able to easily get very good performance, but it's still faster than serial with sufficient amount of threads.

C++ parallel average matrix with OpenMP

Tags:

c++

parallel-processing

matrix

average

openmp

Pablo M

2 Answers

Richard

Reduction

Switching the loop

Zulan

Recent Activity

Donate For Us

C++ parallel average matrix with OpenMP

Tags:

c++

parallel-processing

matrix

average

openmp

Pablo M

2 Answers

Richard

Reduction

Switching the loop

Zulan

Related questions

Recent Activity

Donate For Us