CUDA thread mapping

Question

Being a beginner with CUDA my naive model of the following simple kernel

__global__ void sumColumns(float* d_a, float* d_sum, int n) {
    if (blockIdx.x < 2 && threadIdx.x < n) {
        d_sum[blockIdx.x] += d_a[threadIdx.x * 2 + blockIdx.x];
    }
}

was that given a 1024 x 2 matrix d_a, a call like

sumColumns <<<2, 1024>>> (d_a, d_sum, 1024);

would go through all rows, columns (asynchonously) and then generate the sum of each column (dsum is a 2 x 1 vector initialised to 0). Of course, nothing like this happens. The result varies from run to run; it never produces the sum of the whole column. And in most case the sum is well below +/-1024 (I initialise the matrix to have a column of 1s and another column of -1s).

The model I have in my mind is that, after the kernel call, the GPU generates 1024 threads in each block and then given the current thread the if statement does the correct mapping. Can someone provide a correct description of what is going on?

PkDrew · Accepted Answer

It's a race condition, you have multiple threads trying to write the same memory.

To resolve, you may

Simply use atomicAdd per thread, see below:

__global__ void sumColumns(float* d_a, float* d_sum, int n) {
    if (blockIdx.x < 2 && threadIdx.x < n) {
        atomicAdd(&d_sum[blockIdx.x], d_a[threadIdx.x * 2 + blockIdx.x]);
    }
}

Or, to achieve a better performance, use shared memory to accumulate partial sums per block and then perform a single atomic update, this way, the contention exists only between two blocks, instead of, say, 32 threads per block.


__global__ void sumColumns(float* d_a, float* d_sum, int n) {
    // Shared memory for partial sums, same logic as your blockIdx.x < 2
    __shared__ float s_sum[2]; 
    
    if (threadIdx.x == 0) {
        s_sum[0] = 0.0f;
        s_sum[1] = 0.0f;
    }
    __syncthreads();

    if (blockIdx.x < 2 && threadIdx.x < n) {
        // Reduces global memory atomic contention. 
        // Atomicity lies only within blocks.
        atomicAdd(&s_sum[blockIdx.x], d_a[threadIdx.x * 2 + blockIdx.x]);
    }
    __syncthreads();

    // Single atomic update per block
    if (threadIdx.x == 0) {
        atomicAdd(&d_sum[blockIdx.x], s_sum[blockIdx.x]);
    } 
}

CUDA thread mapping

Tags:

c++

cuda

thread-synchronization

designer0588

1 Answers

PkDrew

Recent Activity

Donate For Us

CUDA thread mapping

Tags:

c++

cuda

thread-synchronization

designer0588

1 Answers

PkDrew

Related questions

Recent Activity

Donate For Us