lets say i have three global array which have been copied into gpu using cudaMemcpy but these gloabl array in c has NOT been allocated using cudaHostAlloc so as to allocate memory that is page-locked instead they are simple gloabl allocation.
int a[100],b [100],c[100];
cudaMemcpy(d_a,a,100*sizeof(int),cudaMemcpyHostToDevice);
cudaMemcpy(d_b,b,100*sizeof(int),cudaMemcpyHostToDevice);
cudaMemcpy(d_c,c,100*sizeof(int),cudaMemcpyHostToDevice);
now i have 10 kernels which are launched in seperate streams so as to run concurrently and some of them are using global array copied in gpu. and now these kernels are running for say 1000 iterations. they dont have to copy anything back to host during iterations.
But the problem is that they are not executing in parallel instead they are going for serial fashion.
  cudaStream_t stream[3];
   for(int i=0;i<3;i++)cudaStreamCreate (&stream[i]);
   for(int i=0;i<100;i++){
       kernel1<<<blocks,threads,0,stream[0]>>>(d_a,d_b);
       kernel2<<<blocks,threads,0,strea[1]>>(d_b,d_c);
      kernal3<<<blocks,threads,0,stream[2]>>>(d_c,d_a);
    cudaDeviceSynchronize();
  }
I can't understand why?
Kernels issued this way:
   for(int i=0;i<100;i++){
    kernel1<<<blocks,threads>>>(d_a,d_b);
    kernel2<<<blocks,threads>>>(d_b,d_c);
    kernal3<<<blocks,threads>>>(d_c,d_a);
    cudaDeviceSynchronize();
  }
Will always run serially. In order to get kernels to run concurrently, they must be issued to separate CUDA streams. And there are other requirements as well. Read the documentation.
You'll need to create some CUDA streams, then launch your kernels like this:
   cudaStream_t stream1, stream2, stream3;
   cudaStreamCreate(&stream1); cudaStreamCreate(&stream2); cudaStreamCreate(&stream3);
   for(int i=0;i<100;i++){
    kernel1<<<blocks,threads,0,stream1>>>(d_a,d_b);
    kernel2<<<blocks,threads,0,stream2>>>(d_b,d_c);
    kernal3<<<blocks,threads,0,stream3>>>(d_c,d_a);
    cudaDeviceSynchronize();
  }
Actually witnessing concurrent kernel execution will also generally require kernels that have limited resource utilization. If a given kernel will "fill" the machine, due to a large number of blocks, or threads per block, or shared memory usage, or some other resource usage, then you won't actually witness concurrency; there's no room left in the machine.
You may also want to review some of the CUDA sample codes, such as simpleStreams and concurrentKernels.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With