There has been much discussion about how to choose the #blocks & blockSize, but I still missing something. Many of my concerns address this question: How CUDA Blocks/Warps/Threads map onto CUDA Cores? (To simplify the discussion, there is enough perThread & perBlock memory. Memory limits are not an issue here.)
kernelA<<<nBlocks, nThreads>>>(varA,constB, nThreadsTotal);
1) To keep the SM as busy as possible, I should set nThreads to a multiple of warpSize.  True?
2) An SM can only execute one kernel at a time. That is all HWcores of that SM are executing only kernelA. (Not some HWcores running kernelA, while others run kernelB.) So if I have only one thread to run, I'm "wasting" the other HWcores. True?
3)If the warp-scheduler issues work in units of warpSize (32 threads), and each SM has 32 HWcores, then the SM would be full utilized.  What happens when the SM has 48 HWcores?  How can I keep all 48 cores full utilized when the scheduler is issuing work in chunks of 32?  (If the previous paragraph is true, wouldn't it be better if the scheduler issued work in units of HWcore size?)
4) It looks like the warp-scheduler queues up 2 tasks at a time. So that when the currently-executing kernel stalls or blocks, the 2nd kernel is swapped in. (It is not clear, but I'll guess the queue here is more than 2 kernels deep.) Is this correct?
5) If my HW has an upper limit of 512 threads-per-block (nThreadsMax), that doesn't mean the kernel with 512 threads will run fastest on one block. (Again, mem not an issue.) There is a good chance I'll get better performance if I spread the 512-thread kernel across many blocks, not just one. The block is executed on one or many SM's. True?
5a) I'm thinking the smaller the better, but does it matter how small I make nBlocks?  The question is, how to choose the value of nBlocks that is decent?  (Not necessarily optimal.)  Is there a mathematical approach to choosing nBlocks, or is it simply trial-n-err.
An SM may contain up to 8 thread blocks in total.
Threads are fundamentally executed in warps of 32 threads. Blocks are composed of 1 or more warps, and grid of 1 or more blocks.
Threads included within a block are divided into batches of 32 threads called warps. The warp is the scheduled unit, so the threads of the same block are executed in a given multiprocessor warp-by-warp in a SIMD (single instruction, multiple data) fashion.
A group of threads is called a CUDA block. CUDA blocks are grouped into a grid. A kernel is executed as a grid of blocks of threads (Figure 2). Each CUDA block is executed by one streaming multiprocessor (SM) and cannot be migrated to other SMs in GPU (except during preemption, debugging, or CUDA dynamic parallelism).
1) Yes.
2) CC 2.0 - 3.0 devices can execute up to 16 grids concurrently. Each SM is limited to 8 blocks so in order to reach full concurrency the device has to have at least 2 SMs.
3) Yes the warp schedulers select and issue warps at at time. Forget the concept of CUDA cores they are irrelevant. In order to hide latency you need to have high instruction level parallelism or a high occupancy. It is recommended to have >25% for CC 1.x and >50% for CC >= 2.0. In general CC 3.0 requires higher occupancy than 2.0 devices due to the doubling of schedulers but only a 33% increase in warps per SM. The Nsight VSE Issue Efficiency experiment is the best way to determine if you had sufficient warps to hide instruction and memory latency. Unfortunately, the Visual Profiler does not have this metric.
4) The warp scheduler algorithm is not documented; however, it does not consider which grid the thread block originated. For CC 2.x and 3.0 devices the CUDA work distributor will distribute all blocks from a grid before distributing blocks from the next grid; however, this is not guaranteed by the programming model.
5) In order to keep the SM busy you have to have sufficient blocks to fill the device. After that you want to make sure you have sufficient warps to reach a reasonable occupancy. There are both pros and cons to using large thread blocks. Large thread blocks in general use less instruction cache and have smaller footprints on cache; however, large thread blocks stall at syncthreads (SM can become less efficient as there are less warps to choose from) and tend to keep instructions executing on similar execution units. I recommend trying 128 or 256 threads per thread block to start. There are good reasons for both larger and smaller thread blocks. 5a) Use the occupancy calculator. Picking too large of a thread block size will often cause you to be limited by registers. Picking too small of a thread block size can find you limited by shared memory or the 8 blocks per SM limit.
Let me try to answer your questions one by one.
According to the NVIDIA Fermi Compute Architecture Whitepaper: "The SM schedules threads in groups of 32 parallel threads called warps. Each SM features two warp schedulers and two instruction dispatch units, allowing two warps to be issued and executed concurrently. Fermi’s dual warp scheduler selects two warps, and issues one instruction from each warp to a group of sixteen cores, sixteen load/store units, or four SFUs. Because warps execute independently, Fermi’s scheduler does not need to check for dependencies from within the instruction stream".
Furthermore, the NVIDIA Keppler Architecture Whitepaper states: "Kepler’s quad warp scheduler selects four warps, and two independent instructions per warp can be dispatched each cycle."
The "excess" cores are therefore used by scheduling more than one warp at a time.
The warp scheduler schedules warps of the same kernel, not different kernels.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With