Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Nested kernels in CUDA

Tags:

arrays

cuda

CUDA currently does not allow nested kernels.

To be specific, I have the following problem: I have N number of M-dimensional data. To process each of the N data-points, three kernels need to be run in a sequence. Since, nesting of kernels is not allowed, I cannot create a kernel with calls to the three kernels. Therefore, I have to process each data-point serially.

One solution is to write a big kernel containing the functionality of all the other three kernels, but I think it will sub-optimal.

Can anyone suggest how streams can be used to run the N data-points in parallel, while retaining the the three smaller kernels.

Thanks.

like image 330
Prasanna Avatar asked Nov 20 '25 21:11

Prasanna


1 Answers

Well, if you want to use streams... you will want to create N streams:

cudaStream_t streams;
streams = malloc(N * sizeof(cudaStream_t));
for(i=0; i<N; i++)
{
    cudaStreamCreate(&streams[i]);
}

Then for the ith data point, you want to use cudaMemcpyAsync for transfers:

cudaMemcpyAsync(dst, src, kind, count, streams[i]);

and call your kernels with all four configuration parameters (sharedMemory can be 0, of course):

kernel_1 <<< nBlocks, nThreads, sharedMemory, streams[i] >>> ( args );
kernel_2 <<< nBlocks, nThreads, sharedMemory, streams[i] >>> ( args );

and of course cleanup:

for(i=0; i<N; i++)
{
    cudaStreamDestroy(streams[i]);
}
free(streams)
like image 59
jmilloy Avatar answered Nov 25 '25 00:11

jmilloy



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!