I am new to CUDA. I was using cuda to find the dot prod of float vectors and I came across a float point addition issue in cuda. In essence following is the simple kernel. I'm using -arch=sm_50 So the basic idea is for the thread_0 to add the values of vector a.
__global__ void temp(float *a, float *b, float *c) {
if (0 == threadIdx.x && blockIdx.x == 0 && blockIdx.y ==0 ) {
    float xx = 0.0f;
        for (int i = 0; i < LENGTH; i++){
            xx += a[i];
        }
        *c = xx;
    }
}
When I initialize 'a' with 1000 elements of 1.0 I get the desired result of 1000.00
but when I initialize 'a' with 1.1, I should get 1100.00xx but istead, I am getting 1099.989014. The cpu implementation simply yields 1100.000024
I am trying to understand what the issue here! :-(
I even tried to count the number of 1.1 elements in the a vector and that yeilds 1000, which is expected. and I even used atomicAdd and still I have the same issue.
would be very grateful if someone could help me out here!
best
EDIT: Biggest concern here is the disparity of the CPU result vs GPU result! I understand floats can be off by some decimal points. But the GPU error is very significant! :-(
It is not possible to represent 1.1 exactly using IEEE-754 floating point representation. As @RobertCrovella mentionned in his comment, the computation performed on the CPU does not use the same IEEE-754 settings than the GPU one.
Indeed, 1.1 in floating point is stored as 0x3F8CCCCD = which is 1.10000002384185. Performing the sum on 1000 elements, the last bits gets lost in rouding, one bit for the first addition, two bits after four, etc, until 10 bits after 1000. Depending on rounding mode, you may truncate the 10 bits for the last half of operations, hence ending up summing 0x3F8CCC00 which is 1.09997558.
The result from CUDA divided by 1000 is 0x3F8CCC71, which is consistent with a calculation in 32 bits.
When compiling on CPU, depending on optimization flags, you may be using fast math, which uses the internal register precision. It can be, if not specifying vector registers, using the x87 FPU which is 80 bits precision. In that occurence, the computation would read 1.1 in float which is 1.10000002384185, add it 1000 times using higher precision, hence not loosing any bit in rounding resulting in 1100.00002384185, and display 1100.000024 which is its round to nearest display.
Depending on compilation flags, the actual equivalent computation on Cpu may require enforcement of 32 bits floating-point arithmetics which can be done using addss of the SSE2 instruction set for example.
You can also play with /fp: option or -mfpmath with the compiler and explore issued instructions. In that case assembly instruction fadd is the 80-bits precision addition.
All of this has nothing to do with GPU floating-point precision. It is rather some misunderstanding of the IEEE-754 norm and the legacy x87 FPU behaviour.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With