I'm having trouble understanding the output of this program
int main()
{
    double x = 1.8939201459282359e-308;
    double y = 4.9406564584124654e-324;
    printf("%23.16e\n", 1.6*y);
    printf("%23.16e\n", 1.7*y);
    printf("%23.16e\n", 1.8*y);
    printf("%23.16e\n", 1.9*y);
    printf("%23.16e\n", 2.0*y);
    printf("%23.16e\n", x + 1.6*y);
    printf("%23.16e\n", x + 1.7*y);
    printf("%23.16e\n", x + 1.8*y);
    printf("%23.16e\n", x + 1.9*y);
    printf("%23.16e\n", x + 2.0*y);
}
The output is
9.8813129168249309e-324
9.8813129168249309e-324
9.8813129168249309e-324
9.8813129168249309e-324
9.8813129168249309e-324
1.8939201459282364e-308
1.8939201459282364e-308
1.8939201459282369e-308
1.8939201459282369e-308
1.8939201459282369e-308
I'm using IEEE arithmetic. The variable y holds the smallest possible IEEE number. The first five prints show a number which is twice y as I would expect. What is confusing me is that the next five prints show different numbers. If 1.6*y is the same as 2.0*y then how can x + 1.6*y be different from x + 2.0*y?
You say that your compiler is Visual C++ 2010 Express. I do not have access to this compiler, but I understand that it generates programs that initially configure the x87 CPU to use 53 bits of precision, in order to emulate IEEE 754 double-precision computations as closely as possible.
Unfortunately, “as closely as possible” is not always close enough. Historical 80-bit floating-point registers can have their significand limited in width for the purpose of emulating double-precision, but they always retain a full range for the exponent. The difference shows in particular when manipulating denormals (like your y).
My explanation would be that in printf("%23.16e\n", 1.6*y);, 1.6*y is computed as a 80-bit reduced-significand full-exponent number (it is thus a normal number), then converted to IEEE 754 double-precision (resulting in a denormal), then printed.
On the other hand, in printf("%23.16e\n", x + 1.6*y);, x + 1.6*y is computed with all 80-bit reduced-significand full-exponent numbers (again all intermediate results are normal numbers), then converted to IEEE 754 double-precision, then printed.
This would explain why 1.6*y prints the same as 2.0*y but has a different effect when added to x. The number that is printed is a double-precision denormal. The number that is added to x is a 80-bit reduced-significand full-exponent normal number (not the same one).
Other compilers, like GCC, do not configure the x87 FPU to manipulate 53-bit significands. This can have the same consequences (in this case x + 1.6*y would be computed with all 80-bit full significand full exponent numbers, and then converted to double-precision for printing or storing in memory). In this case, the issue is noticeable even more often (you do not need to involve denormals or infinite numbers to notice differences).
This article by David Monniaux contains all the details you may wish for and more.
To get rid of the problem (if you consider it to be one), find the flag that tells your compiler to generate SSE2 instructions for floating-point. These implement exactly IEEE 754 semantics for single- and double-precision.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With