Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

CUDA thread mapping

Being a beginner with CUDA my naive model of the following simple kernel

__global__ void sumColumns(float* d_a, float* d_sum, int n) {
    if (blockIdx.x < 2 && threadIdx.x < n) {
        d_sum[blockIdx.x] += d_a[threadIdx.x * 2 + blockIdx.x];
    }
}

was that given a 1024 x 2 matrix d_a, a call like

sumColumns <<<2, 1024>>> (d_a, d_sum, 1024);

would go through all rows, columns (asynchonously) and then generate the sum of each column (dsum is a 2 x 1 vector initialised to 0). Of course, nothing like this happens. The result varies from run to run; it never produces the sum of the whole column. And in most case the sum is well below +/-1024 (I initialise the matrix to have a column of 1s and another column of -1s).

The model I have in my mind is that, after the kernel call, the GPU generates 1024 threads in each block and then given the current thread the if statement does the correct mapping. Can someone provide a correct description of what is going on?

like image 800
designer0588 Avatar asked May 16 '26 21:05

designer0588


1 Answers

It's a race condition, you have multiple threads trying to write the same memory.

To resolve, you may

  • Simply use atomicAdd per thread, see below:
__global__ void sumColumns(float* d_a, float* d_sum, int n) {
    if (blockIdx.x < 2 && threadIdx.x < n) {
        atomicAdd(&d_sum[blockIdx.x], d_a[threadIdx.x * 2 + blockIdx.x]);
    }
}
  • Or, to achieve a better performance, use shared memory to accumulate partial sums per block and then perform a single atomic update, this way, the contention exists only between two blocks, instead of, say, 32 threads per block.

__global__ void sumColumns(float* d_a, float* d_sum, int n) {
    // Shared memory for partial sums, same logic as your blockIdx.x < 2
    __shared__ float s_sum[2]; 
    
    if (threadIdx.x == 0) {
        s_sum[0] = 0.0f;
        s_sum[1] = 0.0f;
    }
    __syncthreads();

    if (blockIdx.x < 2 && threadIdx.x < n) {
        // Reduces global memory atomic contention. 
        // Atomicity lies only within blocks.
        atomicAdd(&s_sum[blockIdx.x], d_a[threadIdx.x * 2 + blockIdx.x]);
    }
    __syncthreads();

    // Single atomic update per block
    if (threadIdx.x == 0) {
        atomicAdd(&d_sum[blockIdx.x], s_sum[blockIdx.x]);
    } 
}
like image 149
PkDrew Avatar answered May 19 '26 10:05

PkDrew