Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the best way to perform parallel reduction to consolidate contributions to a matrix?

I am attempting to parallelise a calculation and consolidate the results into a matrix. A large number of calculations are performed and each one contributes to a summed matrix of all the results.

This is a reduction, many frameworks (such as kokkos and cuda) offer support for the reduction of scalars, summing a number from each parallelised calculation. However I want to reduce a matrix.

The resulting matrix scales with the problem size, but always remains far smaller than the number of parallelised calculations. There are always multiple contributions to each matrix entry.

My code is in c++ and I'm currently using the Kokkos framework to achieve the parallelisation.

Attempts

1

I tried giving each thread a copy of the matrix, copying all these from the device(gpu) to the host(cpu) and summing them in serial.

  • The gpu memory requirement for all the matrices meant I had to perform the calculation in small batches
  • The copying of data from the device to host was huge and was inefficient, just to sum it later
  • The runtime ended up being slower than the serial method

2

As above but I performed the serial summation on one thread on the device(gpu) then copied the summed matrix to the host

  • Still the same gpu memory limit
  • The serial summation on one gpu thread was very slow
  • Minimal memory copy time
  • The runtime just about matched the serial method

3

I created a device-side matrix with the Kokkos::Atomic memory trait, then had each thread += its contribution to the one matrix. This relies on the atomic matrix access to prevent collisions. I then copy this matrix to the host.

  • This has an atomic operation, compromising the parallelization
  • Minimal memory copy time
  • Overall a 20* speed increase, good, but far worse than the theoretical potential of the gpu (A100 with 10752 CUDA cores).
  • This is the scheme I will proceed with, if I can't do any better

Can anyone recommend a better system than the atomic matrix. Is there a better framework in c++ with such functionality standardised.

Minimal example with atomic matrix:

#include <iostream>
#include<Kokkos_Core.hpp>

int main(int argc, char *argv[]) {
Kokkos::initialize();

int matrix_size = 200;

int batches = 10;


Kokkos::View<double **, Kokkos::MemoryTraits<Kokkos::Atomic>> r("result_matrix", matrix_size,
                                                                matrix_size);

for (int batch = 0; batch < batches; batch++) {

    Kokkos::parallel_for(
        "populate", Kokkos::RangePolicy(0, 10752), KOKKOS_LAMBDA(const int i) { 
        
        //calculation goes here 
        //index and values should be calculated i dependent

            r(42, 43) += 0.013;
            r(42, 46) += 0.02;
    });
}

auto h_r = Kokkos::create_mirror_view(r);
Kokkos::deep_copy(h_r, r);

std::cout << batches * 10752 * 0.013 << " should equal " << h_r(42, 43) << std::endl;

std::cout << batches * 10752 * 0.02 << " should equal " << h_r(42, 46) << std::endl;

}
like image 907
Neil Butcher Avatar asked Jan 26 '26 21:01

Neil Butcher


1 Answers

It looks like you have enough memory to set up a separate accumulator matrix for each CUDA core (about 700M values all together).

Each of the 11M threads can the index of the core it's running on and accumulate into the matrix assigned to that core without using any synchronization. Then you can do a parallel merge of the ~15000 result matrices.

like image 169
Matt Timmermans Avatar answered Jan 28 '26 13:01

Matt Timmermans