Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I sum sub-matrices in Eigen

Tags:

c++

openmp

eigen

I have some matrix defined as:

Eigen::MatrixXd DPCint = Eigen::MatrixXd::Zero(p.szZ*(p.na-1),p.szX);

\\ perform some computations and fill every sub-matrix of size [p.szZ,p.szX] with some values
#pragma omp parallel for
for (int i=0; i < p.na-1; i++)
{
...
DPCint(Eigen::seq(i*p.szZ,(i+1)*p.szZ-1),Eigen::all) = ....;
}

\\ Now sum every p.szZ rows to get a matrix that is [p.szZ,p.szX]

In Matlab, this operation is fast and trivial. I can't simply do a += operation here if I want to parallelize the loop with OpenMP. Similarly, I can loop through and sum every set of p.szZ rows, but that loop cannot be parallelized since each thread would be outputting to the same data. Is there some efficient way to use Eigen's indexing operations to sum sub-matrices? This seems like a simple operation and I feel like I'm missing something, but I haven't been able to find a solution for some time.

Clarification

Essentially, after the above loop, I'd like to do this in a single line:

for (int i = 0; i < p.na-1; i++)
{
DPC += DPCint(Eigen::seq(i*p.szZ,(i+1)*p.szZ-1),Eigen::all);
}

In matlab, I can simply reshape the matrix into a 3D matrix and sum along the third dimension. I'm not familiar with Eigen's tensor library, and I hope this operation is doable without resorting to using the tensor library. However, my priority is speed and efficiency, so I'm open to any suggestions.

like image 965
drakon101 Avatar asked Dec 05 '25 17:12

drakon101


1 Answers

Performing a parallel reduction over the na-based axis is not efficient. Indeed, this dimension is already pretty small for multiple threads to be useful, but it also (nearly) force threads to operate on temporary matrices which is inefficient (this is memory-bound so it does not scale well).

An alternative solution is to parallelize the szZ dimension. Each thread can work on a slice and perform a local reduction without temporary matrices. Moreover, this approach should also improve the use of CPU caches (since a sections of DPC computed by each threads are more likely to fit in cache so they are not reloaded from RAM). Here is an (untested) example:

// All thread will execute the following loops (all iterations but on different data blocks)
#pragma omp parallel
for (int i = 0; i < p.na-1; i++)
{
    // "nowait" avoid a synchronization but this require a 
    // static schedule which is a good idea to use here anyway.
    #pragma omp for schedule(static) nowait
    for (int j = 0; j < p.szZ; j++)
        DPC(j, Eigen::all) += DPCint(i*p.szZ+j, Eigen::all);
}

As pointed out by @chtz, it should be better to avoid using a temporary DPCint matrix since the memory throughput is a very limited resource (especially in parallel code).

EDIT: I assumed the matrices are stored in a row-major storage order which is not the case by default. This can be modified (see the doc) and in fact it would make the first and the second loops cache-efficient. However mixing storage order is generally error-prone and using a row-major ordering force you to redefine basic types. The solution of @Homer512 is an alternative implementation certainly better suited for column-major matrices.

like image 154
Jérôme Richard Avatar answered Dec 08 '25 12:12

Jérôme Richard



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!