Floating point number tricks

It may be tongue in cheek, but it's also patronizing. You can't just say "if you think you need float, you don't" because that implies floats are completely useless, not just that you probably don't need them. It's like saying "you'll understand when you're older" to someone who's currently willing and capable of understand something.

1 Like

I'm not sure how "tongue in cheek" it really is if you then double-down.

Floating-point is appropriate whenever you care about relative, not absolute differences. This is incredibly common in real life.

If you're making a baguette and you're a tablespoon short on yeast, you have a major problem. But if you're a tablespoon short on flour you won't even notice. That's because you actually care about relative differences -- within 10% is generally fine.

If your GPS is 5 minutes off going to the corner store you'll be annoyed, but if it's 5 minutes off for a Paris-to-Berlin drive you probably wouldn't notice. Because, again, you care about relative error in the estimate.

All approximations of real numbers have their issues; it's not like fixed-point is perfect either. Just like i32 isn't an integer. Programming is hard; that doesn't mean the problem is misunderstood.


This is a really nice thread - great resources and insights here. Floating Point is one of those roller-coaster rides of "I think I get it, no, I know nothing, I understand more, ..." so common throughout computer science. I found this fascinating article by Hans-J. Boehm, about the math library (in Java) he created to reduce the number of "bug" reports in the Android Calculator.

The library is described in a bit more detail in this paper.

I like this summary of the problem:

"We really want to ask an alternate, less well-studied, question: Can we decide equality for recursive reals computed from integer constants by a combination of the following operations, which we will refer to as "calculator operations":
1. The four basic arithmetic operations, and square roots.
2. The sin, cos, and tan trigonometric functions and their inverses.
3. Exponential and (natural) logarithm functions."

Which he says is mostly solved in the paper, "The Identity Problem for Elementary Functions and Constants"

I don't know of any ports of this work to other platforms or languages, but it seems like the kind of thing the Rust community will do, eventually.

I was reading meap 14 of "Rust in Action" and the code i'm talking about is in page 37. I don't know if there is a more recent release of the book that fixes this


If only IEEE was a thing of the past and a better format like posits [https://posithub.org/khub_doc] were used in modern hardwares and softwares instead. It wouldn't solve equality testing of course but fewer errors propagate with this format.

Posits are slightly more precise for their storage size, but don't solve catastrophic cancellation (nothing finite-sized can), so really don't fundamentally change anything. Numerically-stable algorithms are still required.


I was reading meap 14 of "Rust in Action" and the code i'm talking about is in page 37.

Thanks for the pointer! I've added a link to this discussion to the "Live Book" forum on the chapter.

Yes it is.

I feel some context is called for. That statement was made nearly 40 years ago. By a senior project manager to a very junior project member. It was a different world then. Patronizing was a regular mode of speaking in that situation and nobody took much offence at it. My English teacher in school was far worse.

Besides, John Stuck, for that was the project managers name, had very good humor and could say such things in a very nice way. We learned a lot from him.

True. And nobody did say that. The statement was "If you think you need floating point to solve the problem then you don't understand the problem".

Which as it happens was true. The in house designed and built processors we were using did not have hardware floating point support. The application, 3D radar, did not require floating point. The language we were using, Coral, supported fixed point arithmetic. All in all using float would have resulted in a massive performance drop for no benefit.

Perhaps. Although to be fair, John Stuck he did not leave it at that, he would take time to explain such things. After all we had to get the job done. Sitting around waiting to get older would not have done it.


To be fair, the "doubling down" part was not from the project manager I was quoting. That is from me.

I believe both statements are true to a large extent.

Yes indeed.

Also common is people using floating point when it is not needed. Which is what my quotation is all about.


I think this is 100% correct. As a formalist, I often have to grapple with the logical/formal content of various programming idioms, and floating point is one of the worst offenders when it comes to code that doesn't say what it means. On the one hand, it's entirely well defined by IEEE 754, but if you actually take this seriously they look nothing like real numbers, they are a sequence of add/round/multiply/round operations, and people want to persist in the illusion that they are real numbers, and this makes everything worse, because then you get compiler flags like -ffast-math that say "let's pretend these are real numbers, even though they aren't and we know they aren't, and use this to do transformations" which is a sure-fire way to get all manner of correctness (and safety!) bugs.

At the programmer level, of course we again sell the "floats are basically real numbers" myth that is again harmful because it teaches people not to think about all the approximation errors. There is no good answer to "what tolerance should this equality check get" because that depends on the actual values that went through the code, and for some values the answer may well be "+infinity may not be
enough". So people ignore the problem, you get sloppy thinking, and it is extremely rare to see floating point code that I can actually call "correct".

By comparison, ints are easy. You get wrapping arithmetic mod 2^N, done. This has lots of nice properties like associativity and so on, and if you make sure you don't ever go above 2^N then you are working with "God's integers", and no problems can arise. (Of course the compiler throws a spanner in the works sometimes but it's not too hard to achieve this level of correctness.)

One representation that I wish computers had better support for is rational arithmetic. This is where you store a rational number as a pair (numerator, denominator) of integers and do math on that using the integer ALU. You can do all the nice things with reals here: associativity and commutativity just work, but the finite range of the numerator and denominator are more problematic than you might think, because it's not uncommon for adding a bunch of rational numbers to cause the denominator to grow exponentially; and if you approximate when you hit the end of the range then you are back to all the usual problems with floats.

In summary, if at all possible, use integral types. If you actually are dealing with an integral quantity like number of cents, then use n/100 representation behind a newtype. Use floats only if it doesn't matter if you get the wrong answer.


Yeah. Except one does not always get wrapping, modulo, arithmetic. In C/C++ overflow is wrapping or undefined behavior, depending on if it is a signed or unsigned integer. I forget which is which off hand. Which is part of the problem of course.

I have seen enough bugs cause by integer overflow that I would be very happy if all languages bailed with an exception on all overflows.

Yes, that was the spanner I was talking about. C/C++ try to do a similar thing pretending that signed integers are unbounded using UB to cover their asses, and it leads to the same kind of sloppy thinking on the part of programmers if you forget this. But at least Rust gives you the tools to handle wrapping correctly using things like checked_add or panic-on-overflow.

Indeed in neither case are we working with the platonic integers/reals we'd like to work with, but at least the scope of the problem is somewhat bounded and straightforward when it comes to ints. You have to worry about overflow, and C is not great at giving you the tools to check for this, but that's it. With floats literally every operation is subtly wrong and you don't know by how much.

1 Like

So in other words associativity doesn't work with fixed-size rationals, the same as it doesn't with floating point. (Aside: floating point has commutativity, so there's no advantage there either.)

Certainly if you have a reasonable quantum then you should just use an integer type. ("Reasonable" because you shouldn't measure lengths in integer multiples of the Planck length, even if it might technically work with a large enough integer.) But "doesn't matter if you get the wrong answer" is just FUD.

I think a modicum of Fear, Uncertainty and Doubt is in order when a programmer is using floats.


1 Like

It has commutativity for non-NaN, but okay. Associativity works for fixed-size rationals if you bail when you run out of precision, but I don't really think it is a great solution for all of the things floats do. But I don't think that floats are either. I think we are simply in a "state of sin" right now wrt floating point operations, where we do things we know are not licensed by the model, and we can't come back from that without sacrificing something we'd rather not (which we weren't really getting in the first place).

I would very much like to be able to say something more concrete about the things that float guarantees, but at least the way most people use floats I just can't say anything other than "this code produces a floating point answer that may be any finite number, an infinity, or some NaN; also maybe it trapped somewhere". For a single floating point operation, you can say something better than that, but once the floats get around the error bounds quickly go completely out of control and that's what you get. So I really am not exaggerating here when I say to prepare for the worst.

Now it is possible to do floats right; this is what is usually called "interval arithmetic", where you keep a pair of floats with opposite rounding modes surrounding the real answer. Then at least if you get a finite result you can say that your answer is somewhere in that range, which is more or less what people wanted from floats in the first place. But the bounds on this do tend to grow more than people would like, unless you use a good stable numerical algorithm. You can get bounds that grow more slowly if you assume rounding errors are random, but unfortunately this is simply not true.

The few people who really care about error bounds, who probably are already somewhat skilled in numerical analysis, can turn to crates like honestintervals to track error bounds throughout their computations.

I would prefer awareness and knowledge. Doing things with floats deserves no more "fear" than doing something like (x + y) / 3 with integers -- which, like floating point, cannot be distributed because of finite-precision and rounding issues.

As long as one accepts the fact that these operations are not the same as real number operations, there is no problem. There is no theorem that says floor((x + y) / 3) = floor(x / 3) + floor(y / 3), so there is no reason to expect distributivity here. Clearly the exact same argument applies to floats, but there the tendency to ignore the rounding operation is much stronger.

I certainly don't advocate fear here, but healthy skepticism is exactly what leads me to the maximally pessimistic assumption about the result of FP operations that I mentioned above.

I concur.

To my mind, fear, uncertainty and doubt are common symptoms of incomplete knowledge and awareness.

Also there is a couple of orders more to know and be aware of when using floats than using integers. For example all of this: What Every Computer Scientist Should Know About Floating-Point Arithmetic

I don't know about you but that is a lot more than I am I going to be able to bear in mind all the time. Hence the nervousness.

Now couple that with the fact that most programmers are not even aware there is all that to be aware of.

It's not as unreasonable as you might think: you only need 141 bits to store the circumference of the Earth that way and the diameter of the observable universe fits in 206 bits.

More seriously, a 63 bit integer has a dynamic range of 189 dB. That's massive for most real-world applications.