## Floating-Point Numbers Are Precise!

permalink categories:*programming*originally posted:

*2010-05-22 09:47:31*

(What they lack is *accuracy*!)

There's a great deal of misunderstanding about floating-point numbers. Better men than I have tried to dispell the haze of confusion that surrounds them. But let me make a start.

One source of confusion is when people think of floating-point numbers
as approximations. That's muddy thinking. A floating-point number is
*not* approximately anything; it's an exact value.
The problem is that it may not be the value you wanted.

If I compute
1.0 / 3.0, the number I get won't be 1/3. It'll be very *close* to
1/3, as close as the computer could make it, but it won't be exactly 1/3.
Your computation has *introduced error.*

Now, there are many, many numbers that floating-point formats can represent
exactly. For instance, a 32-bit IEEE floating-point number can represent
some fractions exactly, like 0.5 or 1.5.
It can also represent
every positive integer from 0 to 2^{24}, 16,777,216.
But above that... you start to get into trouble.

I've written a short, reasonably-portable C program to demonstrate
this behavior. It performs addition in a loop using two C `float`
variables. A C `float` is usually a single-precision (32-bit)
IEEE floating point number, and that's the result we'll examine
here. First, the program:

#include <stdio.h> void sum(float f, int stride, int iterations) { double d = f; printf("from %f, add %d to it each time, %d times:\n", f, stride, iterations); for (int i = 0; i < iterations; i++) { f += stride; d += stride; printf("[%2d] %f", i, f); if (d != f) printf(" * error, off by %+d!", (int)(f - d)); printf("\n"); } printf("\n"); } int main(int argc, char *argv[]) { sum(16777210.0f, 1, 10); sum(16777210.0f, 2, 10); sum(16777211.0f, 2, 10); sum(33554430.0f, 3, 10); }Each call to

`sum()`prints a stanza of output. We'll discuss each stanza in turn. Stanza 1:

from 16777210.000000, add 1 to it each time, 10 times: [ 0] 16777211.000000 [ 1] 16777212.000000 [ 2] 16777213.000000 [ 3] 16777214.000000 [ 4] 16777215.000000 [ 5] 16777216.000000 [ 6] 16777216.000000 * error, off by -1! [ 7] 16777216.000000 * error, off by -2! [ 8] 16777216.000000 * error, off by -3! [ 9] 16777216.000000 * error, off by -4!A 32-bit IEEE floating-point number simply cannot represent the number 16777217. When you compute 16777216.0f + 1.0f, it rounds to the nearest number it

*can*represent. If the result is exactly between the two nearest representable values, as it is here, it rounds down. So you get... 16777216.0f, which in this case is the number you started with.

When you add two floating-point numbers together,
if one is a great deal smaller than the other,
the result can easily be just the larger number
unchanged. (How much smaller? For 32-bit floats,
it's if their ratio
is 1:2^{24} or smaller.)
So this computation gets stuck in the mud,
going nowhere.

Can you guess what the next call to
`sum()` will do? It's almost the
same, but it adds 2 each time:

from 16777210.000000, add 2 to it each time, 10 times: [ 0] 16777212.000000 [ 1] 16777214.000000 [ 2] 16777216.000000 [ 3] 16777218.000000 [ 4] 16777220.000000 [ 5] 16777222.000000 [ 6] 16777224.000000 [ 7] 16777226.000000 [ 8] 16777228.000000 [ 9] 16777230.000000No problem, smooth sailing. A 32-bit IEEE floating-point number can precisely represent every

*even*integer from 16777216 (2

^{24}) all the way up to 33554432 (2

^{25}). After that, it can only represent numbers evenly divisible by 4, up to 2

^{26}. And after that, only numbers evenly divisible by 8, up to 2

^{27}. I'm sure you can see the pattern emerging here.

What about our next stanza, starting at 16777211 and adding two each time?

from 16777211.000000, add 2 to it each time, 10 times: [ 0] 16777213.000000 [ 1] 16777215.000000 [ 2] 16777216.000000 * error, off by -1! [ 3] 16777218.000000 * error, off by -1! [ 4] 16777220.000000 * error, off by -1! [ 5] 16777222.000000 * error, off by -1! [ 6] 16777224.000000 * error, off by -1! [ 7] 16777226.000000 * error, off by -1! [ 8] 16777228.000000 * error, off by -1! [ 9] 16777230.000000 * error, off by -1!You should have guessed this one. When we compute 16777215.0f + 2.0f, it rounds down (again), to 16777216.0f. But it

*can*represent even numbers above that, so the error remains constant.

Time for our final stanza. What happens when we start with 33554430 and add 3 each time?

from 33554430.000000, add 3 to it each time, 10 times: [ 0] 33554432.000000 * error, off by -1! [ 1] 33554436.000000 [ 2] 33554440.000000 * error, off by +1! [ 3] 33554444.000000 * error, off by +2! [ 4] 33554448.000000 * error, off by +3! [ 5] 33554452.000000 * error, off by +4! [ 6] 33554456.000000 * error, off by +5! [ 7] 33554460.000000 * error, off by +6! [ 8] 33554464.000000 * error, off by +7! [ 9] 33554468.000000 * error, off by +8!I bet you didn't expect that! Again, a 32-bit IEEE floating-point number can only represent the numbers above 33554432 that are divisible by 4. When you try and compute a value that isn't divisible by 4, you're going to introduce error.

Take a closer look at the series above. Most of the time,
adding 3 actually adds 4, because it rounds to the
nearest number. But the first time, it actually
rounds down. So the *first* time, our error
is -1 (we added 2), and then each subsequent time
our error is +1 (we added 4).

What's the solution? Any good engineer will tell you: use double-precision (64-bit) numbers everywhere. Those can represent every integer from 0 to over 9 quadrillion, and jillions of real numbers in between. They're so precise that in practice you probably won't have any problems.

But what if this still isn't precise enough for you?
At that point you should switch to Python and start
using its built-in `Fraction` object.
The Fraction really understands that the value
1/3 is 1 divided by 3.