As title, I would like to know the right execution order in case we have a 3d block
I think to remember that I read already something regarding it, but it was some time ago, I dont remember where but it was coming by someone who didnt look so reliable..
Anyway I would like to have some confirmations about it.
Is it as the following (divided in warps)?
[0, 0, 0]...[blockDim.x, 0, 0] - [0, 1, 0]...[blockDim.x, 1, 0] - (...) - [0, blockDim.y, 0]...[blockDim.x, blockDim.y, 0] - [0, 0, 1]...[blockDim.x, 0, 1] - (...) - [0, blockDim.y, 1]...[blockDim.x, blockDim.y, 1] - (...) - [blockDim.x, blockDim.y, blockDim.z]
CUDA architecture limits the numbers of threads per block (1024 threads per block limit). The dimension of the thread block is accessible within the kernel through the built-in blockDim variable.
A warp is a set of 32 threads within a thread block such that all the threads in a warp execute the same instruction. These threads are selected serially by the SM. Once a thread block is launched on a multiprocessor (SM), all of its warps are resident until their execution finishes.
numba.cuda.blockDim. The shape of the block of threads, as declared when instantiating the kernel. This value is the same for all threads in a given kernel, even if they belong to different blocks (i.e. each block is “full”). numba.cuda.blockIdx. The block indices in the grid of threads launched a kernel.
By thread indexing we are getting a unique number for each thread and each block in a grid. 1D grid of 1D blocks. Indices given in RED color are the unique numbers for each block and each thread. threadId = (blockIdx.x * blockDim.x) + threadIdx.x. Let's check the equation for Thread (2,0) in Block (1,0).
Yes, that is the correct ordering; threads are ordered with the x dimension varying first, then y, then z (equivalent to column-major order) within a block. The calculation can be expressed as
int threadID = threadIdx.x + 
               blockDim.x * threadIdx.y + 
               (blockDim.x * blockDim.y) * threadIdx.z;
int warpID = threadID / warpSize;
int laneID = threadID % warpsize;
Here threadID is the thread number within the block, warpID is the warp within the block and laneID is the thread number within the warp.
Note that threads are not necessarily executed in any sort of predicable order related to this ordering within a block. The execution model guarantees that threads in the same warp are executed "lock-step", but you can't infer any more than that from the thread numbering within a block.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With