CUDA thread block size 1024 doesn't work (cc=20, sm=21)

Question

My running config: - CUDA Toolkit 5.5 - NVidia Nsight Eclipse edition - Ubuntu 12.04 x64 - CUDA device is NVidia GeForce GTX 560: cc=20, sm=21 (as you can see I can use blocks up to 1024 threads)

I render my display on iGPU (Intel HD Graphics), so I can use Nsight debugger.

However I encountered some weird behaviour, when I set threads > 960.

Code:

#include <stdio.h>
#include <cuda_runtime.h>

__global__ void mytest() {
    float a, b;
    b = 1.0F;
    a = b / 1.0F;
}

int main(void) {

    // Error code to check return values for CUDA calls
    cudaError_t err = cudaSuccess;

    // Here I run my kernel
    mytest<<<1, 961>>>();

    err = cudaGetLastError();

    if (err != cudaSuccess) {
        fprintf(stderr, "error=%s
", cudaGetErrorString(err));
        exit (EXIT_FAILURE);
    }

    // Reset the device and exit
    err = cudaDeviceReset();

    if (err != cudaSuccess) {
        fprintf(stderr, "Failed to deinitialize the device! error=%s
",
                cudaGetErrorString(err));
        exit (EXIT_FAILURE);
    }

    printf("Done
");
    return 0;
}

And... it doesn't work. The problem is in the last line of code with float division. Every time I try to divide by float, my code compiles, but doesn't work. The output error at runtime is:

error=too many resources requested for launch

Here's what I get in debug, when I step it over:

warning: Cuda API error detected: cudaLaunch returned (0x7)

Build output using -Xptxas -v:

12:57:39 **** Incremental Build of configuration Debug for project block_size_test ****
make all 
Building file: ../src/vectorAdd.cu
Invoking: NVCC Compiler
/usr/local/cuda-5.5/bin/nvcc -I"/usr/local/cuda-5.5/samples/0_Simple" -I"/usr/local/cuda-5.5/samples/common/inc" -G -g -O0 -m64 -keep -keep-dir /home/vitrums/cuda-workspace-trashcan -optf /home/vitrums/cuda-workspace/block_size_test/options.txt -gencode arch=compute_20,code=sm_20 -gencode arch=compute_20,code=sm_21 -odir "src" -M -o "src/vectorAdd.d" "../src/vectorAdd.cu"
/usr/local/cuda-5.5/bin/nvcc --compile -G -I"/usr/local/cuda-5.5/samples/0_Simple" -I"/usr/local/cuda-5.5/samples/common/inc" -O0 -g -gencode arch=compute_20,code=compute_20 -gencode arch=compute_20,code=sm_21 -keep -keep-dir /home/vitrums/cuda-workspace-trashcan -m64 -optf /home/vitrums/cuda-workspace/block_size_test/options.txt  -x cu -o  "src/vectorAdd.o" "../src/vectorAdd.cu"
../src/vectorAdd.cu(7): warning: variable "a" was set but never used

../src/vectorAdd.cu(7): warning: variable "a" was set but never used

ptxas info    : 4 bytes gmem, 8 bytes cmem[14]
ptxas info    : Function properties for _ZN4dim3C1Ejjj
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Compiling entry function '_Z6mytestv' for 'sm_21'
ptxas info    : Function properties for _Z6mytestv
    8 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 34 registers, 8 bytes cumulative stack size, 32 bytes cmem[0]
ptxas info    : Function properties for _ZN4dim3C2Ejjj
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
Finished building: ../src/vectorAdd.cu

Building target: block_size_test
Invoking: NVCC Linker
/usr/local/cuda-5.5/bin/nvcc --cudart static -m64 -link -o  "block_size_test"  ./src/vectorAdd.o   
Finished building target: block_size_test


12:57:41 Build Finished (took 1s.659ms)

When I add -keep key, the compiler generates .cubin file, but I can't read it to find out the values of smem and reg, following this topic too-many-resources-requested-for-launch-how-to-find-out-what-resources-/. At least nowadays this file must have some different format.

Therefore I'm forced to use 256 threads per block, which is probably not a bad idea, considering this .xls: CUDA_Occupancy_calculator.

Anyway. Any help will be appreciated.

Michael M. · Accepted Answer

I filled the CUDA Occupancy calculator file with the current informations :

Compute capability : 2.1
Threads per block : 961
Registers per thread : 34
Shared memory : 0

I got 0% occupancy, limited by registers count.
If you set the number of thread to 960, you have 63% occupancy, which explains why it works.

Try to limit the count of registers to 32 and set the numbers of threads to 1024 to have 67% occupancy.

To limit the count of registers, use the following option : nvcc [...] --maxrregcount=32

CUDA thread block size 1024 doesn't work (cc=20, sm=21)

Tags:

cuda

vitrums

1 Answers

Michael M.

Recent Activity

Donate For Us

CUDA thread block size 1024 doesn't work (cc=20, sm=21)

Tags:

cuda

vitrums

1 Answers

Michael M.

Related questions

Recent Activity

Donate For Us