Floating-Point Numbers Are Precise!
permalink categories: programming originally posted: 2010-05-22 09:47:31(What they lack is accuracy!)
There's a great deal of misunderstanding about floating-point numbers. Better men than I have tried to dispell the haze of confusion that surrounds them. But let me make a start.
One source of confusion is when people think of floating-point numbers as approximations. That's muddy thinking. A floating-point number is not approximately anything; it's an exact value. The problem is that it may not be the value you wanted.
If I compute 1.0 / 3.0, the number I get won't be 1/3. It'll be very close to 1/3, as close as the computer could make it, but it won't be exactly 1/3. Your computation has introduced error.
Now, there are many, many numbers that floating-point formats can represent exactly. For instance, a 32-bit IEEE floating-point number can represent some fractions exactly, like 0.5 or 1.5. It can also represent every positive integer from 0 to 224, 16,777,216. But above that... you start to get into trouble.
I've written a short, reasonably-portable C program to demonstrate this behavior. It performs addition in a loop using two C float variables. A C float is usually a single-precision (32-bit) IEEE floating point number, and that's the result we'll examine here. First, the program:
#include <stdio.h> void sum(float f, int stride, int iterations) { double d = f; printf("from %f, add %d to it each time, %d times:\n", f, stride, iterations); for (int i = 0; i < iterations; i++) { f += stride; d += stride; printf("[%2d] %f", i, f); if (d != f) printf(" * error, off by %+d!", (int)(f - d)); printf("\n"); } printf("\n"); } int main(int argc, char *argv[]) { sum(16777210.0f, 1, 10); sum(16777210.0f, 2, 10); sum(16777211.0f, 2, 10); sum(33554430.0f, 3, 10); }Each call to sum() prints a stanza of output. We'll discuss each stanza in turn. Stanza 1:
from 16777210.000000, add 1 to it each time, 10 times: [ 0] 16777211.000000 [ 1] 16777212.000000 [ 2] 16777213.000000 [ 3] 16777214.000000 [ 4] 16777215.000000 [ 5] 16777216.000000 [ 6] 16777216.000000 * error, off by -1! [ 7] 16777216.000000 * error, off by -2! [ 8] 16777216.000000 * error, off by -3! [ 9] 16777216.000000 * error, off by -4!A 32-bit IEEE floating-point number simply cannot represent the number 16777217. When you compute 16777216.0f + 1.0f, it rounds to the nearest number it can represent. If the result is exactly between the two nearest representable values, as it is here, it rounds down. So you get... 16777216.0f, which in this case is the number you started with.
When you add two floating-point numbers together, if one is a great deal smaller than the other, the result can easily be just the larger number unchanged. (How much smaller? For 32-bit floats, it's if their ratio is 1:224 or smaller.) So this computation gets stuck in the mud, going nowhere.
Can you guess what the next call to sum() will do? It's almost the same, but it adds 2 each time:
from 16777210.000000, add 2 to it each time, 10 times: [ 0] 16777212.000000 [ 1] 16777214.000000 [ 2] 16777216.000000 [ 3] 16777218.000000 [ 4] 16777220.000000 [ 5] 16777222.000000 [ 6] 16777224.000000 [ 7] 16777226.000000 [ 8] 16777228.000000 [ 9] 16777230.000000No problem, smooth sailing. A 32-bit IEEE floating-point number can precisely represent every even integer from 16777216 (224) all the way up to 33554432 (225). After that, it can only represent numbers evenly divisible by 4, up to 226. And after that, only numbers evenly divisible by 8, up to 227. I'm sure you can see the pattern emerging here.
What about our next stanza, starting at 16777211 and adding two each time?
from 16777211.000000, add 2 to it each time, 10 times: [ 0] 16777213.000000 [ 1] 16777215.000000 [ 2] 16777216.000000 * error, off by -1! [ 3] 16777218.000000 * error, off by -1! [ 4] 16777220.000000 * error, off by -1! [ 5] 16777222.000000 * error, off by -1! [ 6] 16777224.000000 * error, off by -1! [ 7] 16777226.000000 * error, off by -1! [ 8] 16777228.000000 * error, off by -1! [ 9] 16777230.000000 * error, off by -1!You should have guessed this one. When we compute 16777215.0f + 2.0f, it rounds down (again), to 16777216.0f. But it can represent even numbers above that, so the error remains constant.
Time for our final stanza. What happens when we start with 33554430 and add 3 each time?
from 33554430.000000, add 3 to it each time, 10 times: [ 0] 33554432.000000 * error, off by -1! [ 1] 33554436.000000 [ 2] 33554440.000000 * error, off by +1! [ 3] 33554444.000000 * error, off by +2! [ 4] 33554448.000000 * error, off by +3! [ 5] 33554452.000000 * error, off by +4! [ 6] 33554456.000000 * error, off by +5! [ 7] 33554460.000000 * error, off by +6! [ 8] 33554464.000000 * error, off by +7! [ 9] 33554468.000000 * error, off by +8!I bet you didn't expect that! Again, a 32-bit IEEE floating-point number can only represent the numbers above 33554432 that are divisible by 4. When you try and compute a value that isn't divisible by 4, you're going to introduce error.
Take a closer look at the series above. Most of the time, adding 3 actually adds 4, because it rounds to the nearest number. But the first time, it actually rounds down. So the first time, our error is -1 (we added 2), and then each subsequent time our error is +1 (we added 4).
What's the solution? Any good engineer will tell you: use double-precision (64-bit) numbers everywhere. Those can represent every integer from 0 to over 9 quadrillion, and jillions of real numbers in between. They're so precise that in practice you probably won't have any problems.
But what if this still isn't precise enough for you? At that point you should switch to Python and start using its built-in Fraction object. The Fraction really understands that the value 1/3 is 1 divided by 3.