My Floating-point problem - Trial in C++/Python

Question

In what follows, IEEE-754 Double-precision floating-point format is taken for granted to be used.

Python: "...almost all machines use IEEE 754 binary floating-point arithmetic, and almost all platforms map Python floats to IEEE 754 binary64 'double precision' values."

C++: associates double with IEEE-754 double-precision.

Machine epsilon is s.t. fl(1+epsilon)>1.

Using double precision, formally epsilon = 2^-52, but mostly implemented (because of rounding-to-nearest) as epsilon = 2^-53.

In Python (Spyder), when I do:

i = 1.0 + 2**-53

print(i)

>> 1.0

C++ version:

#include <iostream>
#include <cmath>
#include <iomanip>
    
int main() {
        
    double i = std::pow(2.0, -53);
        
    double j = 1.0 + i;
        
    std::cout << j;
    
    return 0;
        
}

>> 1.0

Furthermore, when I do in Python:

i = 1.0 + 2**-52 + 2**-53

print(i)

>> 1.0000000000000004

C++ version:

#include <iostream>
#include <cmath>
#include <iomanip>
        
int main() {
                
    double i = std::pow(2.0, -52);
                
    double j = 1.0 + i;
                
    double k = j + std::pow(2.0, -53);
                
    std::cout << std::setprecision(17) << k;
        
    return 0;

}
    
>> 1.0000000000000004

where the order of addition is from left to right so that the magic of the non-associativity does not come into play, i.e.

i = 1.0 + 2**-52 + 2**-53 <=> i = (1.0 + 2**-52) + 2**-53.

The weird (to me) thing happens here. Firstly, 1 + 2**-52 is stored (in registers) as exactly the Double Precision format equivalent of 1 + 2*epsilon, i.e.

0 + 01111111111 + 000...01

Moreover, 2**-53 is stored as:

0 + 01111001010 + 000...00

If we write these in the form (mantissa)*(2**exponent), 1 + 2**-52 is (1.0000...01)*(2**0), and 2**-53 is (1.0000...00)*(2**-53), where the mantissa (1.xxxx...xx) length is 53 (including implicit 1 bit). A bitwise addition requires the same exponent, thus I shift the mantissa of the smaller one (i.e. 2**-53) so that it has exponent 0:

(1.0000...00)*(2**-53) -> (0.|52 zeros here|10000...00)*(2**0)

So that the addition is like:

1.0000...010 (51 zero digits after the radix, then a '1' and '0' making a total of 53 digits)
0.0000...001 (52 zero digits after the radix + '1' digit making a total of 53 digits)
+
------------
1.0000...011 (53 digits after the radix)

Now, mantissa should be 53 in total, but it is 54 above, thus it should be rounded.

My question begins here: Is the reason why both of these programming languages give 1.0000000000000004 output when I do 1.0 + 2**-52 + 2**-53 because tie-to-even rule is implemented so that 1.0000...011 is rounded to 1.0000...10, which is esentially 1.0000000000000004 up to 16 digits precision? or is it (completely) something else & I made a mistake in the calculations etc.?

Sorry if it seems as a bit of overkill or overthinking on a simple subject like this, however, it bothers me for days and I could not figure out the reason why or could not verify my thoughts. Any answer and comment is appreciated.

Tim Peters · Accepted Answer

Yes, it's entirely about round-to-nearest/even. Python's float.hex() can show you the bits directly:

>>> 1.0 + 2**-52 + 2**-53
1.0000000000000004
>>> _.hex()
'0x1.0000000000002p+0'

Although Python and C have little to do with this: it's almost certainly a result of how your CPU/FPU implement float addition (they almost certainly implement 754-style nearest/even rounding in hardware by default). `` Note too that associativity doesn't matter in this specific example. 1.0 + 2**-52 and 2**-52 + 2**-53 are both exactly representable on their own.

user2357112 supports Monica · Answer

Yeah, ties round to even. That's the default IEEE 754 rounding mode. IEEE 754 defines 4 possible rounding modes:

Round to nearest, breaking ties toward even.
Round toward zero.
Round toward positive infinity.
Round toward negative infinity.

Picking anything but the default is an extremely specialized operation that most languages don't even support, though.

My Floating-point problem - Trial in C++/Python

Tags:

c++

python

floating-point

ieee-754

ConventionalProgrammer

2 Answers

Tim Peters

user2357112 supports Monica

Recent Activity

Donate For Us

My Floating-point problem - Trial in C++/Python

Tags:

c++

python

floating-point

ieee-754

ConventionalProgrammer

2 Answers

Tim Peters

user2357112 supports Monica

Related questions

Recent Activity

Donate For Us