Why Does this CUDA Code Loop Indefinitely?

Question

The following code runs indefinitely, as kernel_loop is stuck in an infinite loop. Shouldn't it be the case that the two small kernels get launched concurrently?

#include <cuda.h>
#include <cuda_runtime.h>
#include <cuda_runtime_api.h>


int *d_x;
static constexpr int N = 1;

__global__ void init_buff(int *buff) {
  for (int i = 0; i < N; i++) {
    buff[i] = i;
  }
}

__global__ void kernel_loop(volatile int *buff) {
  while (true) {
    __threadfence();
    if (buff[0]) {
      break;
    }
  }
}

__global__ void kernel_write(volatile int *buff) {
  buff[0] = 1;
}

int main() {
  cudaMalloc(reinterpret_cast<void **>(&d_x), sizeof(int) * N);

  init_buff<<<1, 1>>>(d_x);

  cudaDeviceSynchronize();

  cudaStream_t stream1, stream2;
  cudaStreamCreateWithFlags(&stream1, cudaStreamDefault);
  cudaStreamCreateWithFlags(&stream2, cudaStreamDefault);

  cudaDeviceSynchronize();


  kernel_loop<<<1, 1, 0, stream1>>>(d_x);
  kernel_write<<<1, 1, 0, stream2>>>(d_x);

  cudaDeviceSynchronize();

  return 0;
}

Additionally, if I change the order of the launches like so:

  kernel_write<<<1, 1, 0, stream2>>>(d_x);
  kernel_loop<<<1, 1, 0, stream1>>>(d_x);

the program runs to completion.

Furthermore,

  cudaStreamCreateWithFlags(&stream1, cudaStreamNonBlocking);
  cudaStreamCreateWithFlags(&stream2, cudaStreamNonBlocking);

also causes an infinte loop.

For the record, __nanosleep-ing the looping thread also doesn't work.

EDIT:

As per Ext3h's comment, added a __threadfence() to the writer kernel.

Ext3h · Accepted Answer

Shouldn't it be the case that the two small kernels get launched concurrently?

Not necessarily, there's a very long list of other constraints may prohibit it.

The most important one when trying to run this on Windows: The driver is batching kernel launches into more coarse command buffers that get sent to the GPU. Grids that end up in the same buffer and don't have a dependency (i.e. they belong to a different stream) have a chance to run in parallel. Grids that end up in different buffers are very unlikely to overlap.

If you are running your example on an idle GPU, following cudaDeviceSynchronize the driver will not wait for the second kernel dispatch before sending a work package to the GPU, as it tries to get the GPU back to working ASAP. Batching only occurs under load.

You can use GPUView to inspect the actual work packages that end up going to the GPU, how they end up scheduled and blocked etc. It will not let you see grids, but you will understand why something isn't running.

nsight can also tell you something similar, but it will usually not let you see why two grids did not overlap. In return it will let you see timing on a grid granularity.

Why Does this CUDA Code Loop Indefinitely?

Tags:

c++

cuda

gpu

Elvir Crncevic

1 Answers

Ext3h

Recent Activity

Donate For Us

Why Does this CUDA Code Loop Indefinitely?

Tags:

c++

cuda

gpu

Elvir Crncevic

1 Answers

Ext3h

Related questions

Recent Activity

Donate For Us