Floating point number tricks

I was reading meap 14 of "Rust in Action" and the code i'm talking about is in page 37.

Thanks for the pointer! I've added a link to this discussion to the "Live Book" forum on the chapter.

Yes it is.

I feel some context is called for. That statement was made nearly 40 years ago. By a senior project manager to a very junior project member. It was a different world then. Patronizing was a regular mode of speaking in that situation and nobody took much offence at it. My English teacher in school was far worse.

Besides, John Stuck, for that was the project managers name, had very good humor and could say such things in a very nice way. We learned a lot from him.

True. And nobody did say that. The statement was "If you think you need floating point to solve the problem then you don't understand the problem".

Which as it happens was true. The in house designed and built processors we were using did not have hardware floating point support. The application, 3D radar, did not require floating point. The language we were using, Coral, supported fixed point arithmetic. All in all using float would have resulted in a massive performance drop for no benefit.

Perhaps. Although to be fair, John Stuck he did not leave it at that, he would take time to explain such things. After all we had to get the job done. Sitting around waiting to get older would not have done it.


To be fair, the "doubling down" part was not from the project manager I was quoting. That is from me.

I believe both statements are true to a large extent.

Yes indeed.

Also common is people using floating point when it is not needed. Which is what my quotation is all about.


I think this is 100% correct. As a formalist, I often have to grapple with the logical/formal content of various programming idioms, and floating point is one of the worst offenders when it comes to code that doesn't say what it means. On the one hand, it's entirely well defined by IEEE 754, but if you actually take this seriously they look nothing like real numbers, they are a sequence of add/round/multiply/round operations, and people want to persist in the illusion that they are real numbers, and this makes everything worse, because then you get compiler flags like -ffast-math that say "let's pretend these are real numbers, even though they aren't and we know they aren't, and use this to do transformations" which is a sure-fire way to get all manner of correctness (and safety!) bugs.

At the programmer level, of course we again sell the "floats are basically real numbers" myth that is again harmful because it teaches people not to think about all the approximation errors. There is no good answer to "what tolerance should this equality check get" because that depends on the actual values that went through the code, and for some values the answer may well be "+infinity may not be
enough". So people ignore the problem, you get sloppy thinking, and it is extremely rare to see floating point code that I can actually call "correct".

By comparison, ints are easy. You get wrapping arithmetic mod 2^N, done. This has lots of nice properties like associativity and so on, and if you make sure you don't ever go above 2^N then you are working with "God's integers", and no problems can arise. (Of course the compiler throws a spanner in the works sometimes but it's not too hard to achieve this level of correctness.)

One representation that I wish computers had better support for is rational arithmetic. This is where you store a rational number as a pair (numerator, denominator) of integers and do math on that using the integer ALU. You can do all the nice things with reals here: associativity and commutativity just work, but the finite range of the numerator and denominator are more problematic than you might think, because it's not uncommon for adding a bunch of rational numbers to cause the denominator to grow exponentially; and if you approximate when you hit the end of the range then you are back to all the usual problems with floats.

In summary, if at all possible, use integral types. If you actually are dealing with an integral quantity like number of cents, then use n/100 representation behind a newtype. Use floats only if it doesn't matter if you get the wrong answer.


Yeah. Except one does not always get wrapping, modulo, arithmetic. In C/C++ overflow is wrapping or undefined behavior, depending on if it is a signed or unsigned integer. I forget which is which off hand. Which is part of the problem of course.

I have seen enough bugs cause by integer overflow that I would be very happy if all languages bailed with an exception on all overflows.

Yes, that was the spanner I was talking about. C/C++ try to do a similar thing pretending that signed integers are unbounded using UB to cover their asses, and it leads to the same kind of sloppy thinking on the part of programmers if you forget this. But at least Rust gives you the tools to handle wrapping correctly using things like checked_add or panic-on-overflow.

Indeed in neither case are we working with the platonic integers/reals we'd like to work with, but at least the scope of the problem is somewhat bounded and straightforward when it comes to ints. You have to worry about overflow, and C is not great at giving you the tools to check for this, but that's it. With floats literally every operation is subtly wrong and you don't know by how much.

1 Like

So in other words associativity doesn't work with fixed-size rationals, the same as it doesn't with floating point. (Aside: floating point has commutativity, so there's no advantage there either.)

Certainly if you have a reasonable quantum then you should just use an integer type. ("Reasonable" because you shouldn't measure lengths in integer multiples of the Planck length, even if it might technically work with a large enough integer.) But "doesn't matter if you get the wrong answer" is just FUD.

I think a modicum of Fear, Uncertainty and Doubt is in order when a programmer is using floats.


1 Like

It has commutativity for non-NaN, but okay. Associativity works for fixed-size rationals if you bail when you run out of precision, but I don't really think it is a great solution for all of the things floats do. But I don't think that floats are either. I think we are simply in a "state of sin" right now wrt floating point operations, where we do things we know are not licensed by the model, and we can't come back from that without sacrificing something we'd rather not (which we weren't really getting in the first place).

I would very much like to be able to say something more concrete about the things that float guarantees, but at least the way most people use floats I just can't say anything other than "this code produces a floating point answer that may be any finite number, an infinity, or some NaN; also maybe it trapped somewhere". For a single floating point operation, you can say something better than that, but once the floats get around the error bounds quickly go completely out of control and that's what you get. So I really am not exaggerating here when I say to prepare for the worst.

Now it is possible to do floats right; this is what is usually called "interval arithmetic", where you keep a pair of floats with opposite rounding modes surrounding the real answer. Then at least if you get a finite result you can say that your answer is somewhere in that range, which is more or less what people wanted from floats in the first place. But the bounds on this do tend to grow more than people would like, unless you use a good stable numerical algorithm. You can get bounds that grow more slowly if you assume rounding errors are random, but unfortunately this is simply not true.

The few people who really care about error bounds, who probably are already somewhat skilled in numerical analysis, can turn to crates like honestintervals to track error bounds throughout their computations.

I would prefer awareness and knowledge. Doing things with floats deserves no more "fear" than doing something like (x + y) / 3 with integers -- which, like floating point, cannot be distributed because of finite-precision and rounding issues.

As long as one accepts the fact that these operations are not the same as real number operations, there is no problem. There is no theorem that says floor((x + y) / 3) = floor(x / 3) + floor(y / 3), so there is no reason to expect distributivity here. Clearly the exact same argument applies to floats, but there the tendency to ignore the rounding operation is much stronger.

I certainly don't advocate fear here, but healthy skepticism is exactly what leads me to the maximally pessimistic assumption about the result of FP operations that I mentioned above.

I concur.

To my mind, fear, uncertainty and doubt are common symptoms of incomplete knowledge and awareness.

Also there is a couple of orders more to know and be aware of when using floats than using integers. For example all of this: What Every Computer Scientist Should Know About Floating-Point Arithmetic

I don't know about you but that is a lot more than I am I going to be able to bear in mind all the time. Hence the nervousness.

Now couple that with the fact that most programmers are not even aware there is all that to be aware of.

It's not as unreasonable as you might think: you only need 141 bits to store the circumference of the Earth that way and the diameter of the observable universe fits in 206 bits.

More seriously, a 63 bit integer has a dynamic range of 189 dB. That's massive for most real-world applications.

Except it's also unreasonable to state either of those to that level of accuracy. For example, NASA only lists the radius of the earth to 7 sigfigs, and of course an f64 offers about 16. So there's essentially no value to the extra precision from storing in Planck units. Not to mention that doing this would also imply using values like g ≈ 10-51, which would make even basic calculations really awkward -- distance fallen in 1s would use ridiculously-large temporaries.

Without even taking advantage of the exponent, an f64 has a dynamic range of 159 dB. That's also massive for most real-world applications.

Unfortunately it doesn't end at representation, if your algorithms exacerbate errors (which is not uncommon in simulations). Representing a single value in any of these formats is not usually a problem; the real problem is maintaining this input accuracy level all the way to the end of the computation.

But I think that moving to higher accuracy is generally just papering over a not well understood problem and hoping that you use enough overkill to make it not relevant. If you actually analyze the constraints of a given problem to find the optimal representation, I think the answer will never be floating point, except that it has a "hardware subsidy" making it sometimes appealing in high performance environments. Floating point representation is very much a jack-of-all-trades, master of none situation.

1 Like

It sucks that all options are crappy but that's the reality we live in, and sticking one's head in the sand isn't solving the problem. The best we can do is quantify the errors to know exactly how bad it is, and decide after the fact whether that was good enough.

Floats have hardware support, and the reasons for this are pretty good: after all, they are okayish at a lot of problems and hardware manufacturers are trying to satisfy a lot of uses. And most users are not in a position to get custom hardware for their problem, hence why folks like me are formalizing the IEEE spec despite knowing how terrible it is. Welcome to engineering indeed.

I can't suggest a better general solution because the solutions are domain dependent. The problems where floats work best are probably in things like games where "drift" cause by floating point error can be reinterpreted as intentional, and high performance is more important than getting the right answer (the code itself gets to define what "right" means). It still causes problems here when you want things to be internally coherent, though.

1 Like

You keep speaking in vagueries and there's nothing to learn from that. How much performance benefit would we see if there were hardware support for rationals? Why is that needed when we already have hardware support for integers, and rationals are just pairs of integers? What other representations have you used, and what are their downsides? What domain are you working in where the correctness bounds of float operations are unable to satisfy your standards, and why exactly do they fail?

I know that Forth uses rational arithmetic by default, and it gets flak sometimes for the performance penalty that this incurs (despite the fact that algorithmically there is no reason for it to be slower). I think it's about a 2x difference, not intolerable but quite measurable. Maybe there are some SIMD tricks you can use to improve this but I doubt it since the two components interact - you really want the FPU (RPU?) to do the mixing of values itself to get comparable performance.

Another option I haven't mentioned is bignum rational arithmetic, where you just failover to larger allocations when you run out of precision. This is great when you don't want to think about approximations at all until the end, and it's trivially easy to verify correct. The obvious downside is the allocation and memory accesses involved, and the exponential growth problem is even bigger here because you can literally eat up your memory in a few operations if you aren't careful. You can keep this under control by periodically reducing the fractions to lowest terms, but this isn't free either.

Luckily, most of the time I can live by my own advice earlier and avoid floats entirely in favor of integers. But I have done numerical analysis work before, and tracking the approximations is not easy. An interval approximation approach is a good first attempt, but the bounds can grow exponentially under some usage patterns, so you can't just deploy it naively, you have to design the algorithm to actually have bounded errors. There is no silver bullet here.

Also, there are no "correctness bounds of float operations", in the sense of a bound on the error independent of the values being operated on, that's the problem. With integer division, you know that for any inputs the answer is at most 1 away from the real division (unless it is division by 0 or INT_MAX / -1), but for floats the answer has bounded relative error unless it's a denormal in which case it could be much bigger relative error, or if it underflows in which case the error is infinite. Oops.

You might be interested in this verification of the Lorenz Attractor in Isabelle/HOL (warning: lots of math), which shows some nice techniques for dealing with verification of floating point code using "zonotope" enclosures that are something like projections of high dimensional cubes.

1 Like

They also make your precision depend on the magnitude of a number, which means that choice of units had an impact that may be unexpected.

Having looked into them at one point, I wouldn't strongly object to them, but they don't seem like a strong winner either. Approximating real numbers with a finite number of bits requires compromise and not every compromise is best for everyone.

1 Like