Floating point number tricks

What's an example that fails to be commutative? (Other than which value of NaN is used (i.e. quiet vs signaling, and what the payload is), which AFAIK cannot be relied on anyways.)

Are (+0.0) + (-0.0) and (+0.0) * (-0.0) commutative? What are the signs of the resulting floating-point zeros?

1 Like

Yes, they're commutative:


It's a bit more nuanced unfortunately. The facts above do not, however, imply that the sum for a*x+b*y is commutative since the generated code may use the multiply-add instruction internally. Also, anything that can sometimes introduce an intermediate rounding from 80-bit float to 64-bit float depending on how it's compiled can result in the result not being the same. Practical floating point computations are simply not to be trusted for predictability from seeming commutativity in the simplest of cases.

Not in Rust. It was proposed (RFC #2686), but didn't happen. If you want fma you need to call mul_add yourself.

EDIT: The portable-simd project is looking at maybe exporting different versions of the types that will allow transformations like multiplying by a reciprocal instead of dividing, contracting with fma, etc. So there may be versions of this that happen in the future.


There exist subsets of floats where computation is guaranteed to produce an exact value. For example, the integers. If you are working within such a subset, then exact equality makes sense.

In my ijson crate, I use exact equality in several places to determine whether a particular conversion is lossless. For example, if I want to know whether a particular f64 value is representable in an f32, then the easiest thing is to cast it back and forth and check for exact equality.

There's also a fun trick you can do to determine if an integer can be exactly represented by a float:

fn can_represent_as_f64(x: u64) -> bool {
    x.leading_zeros() + x.trailing_zeros() >= 11
1 Like

Floating-point is somewhat counter-intuitive, but not purely magic. It's just because some of our assumptions in integers doesn't apply to floating math.

I also had the same question with you before: why didn't compilers implement float equal as 'less than epsilon' by default? And I found doing so will (1) Making it more confusing sometimes (2) In some rare cases we still need the 'accurate' compare (3) Many hardware targets have instructions to compare them, compiler can directly translate these comparison into the instructions.

Precision lost in float math may be due to various reasons. For example, 0.1 in decimal is not exactly represent-able in binary float numbers. Think about 1/3 in decimal, this number exists, but you're never be able to write down the accurate value in finite numbers (0.33333....). This kind of lost is related to binary format.

Another common reason of precision lost is how float numbers are represented underneath. It's sign*mantissa*2^exponent. This makes it be able to hold large but not 'accurate' number like 111111..0000000000000000 (in binary). And that's why it's called 'floating point'. When a very large number adds to a very small number, the lost happens since we have to shift exponents into the same to make mantissa add-able. This leads to a strange fact: not every i64 number is exactly represent-able in f64 (such as 1241523563262624633), although maximum of f64 is about 180000000... (300+ zeroes). So you'll understand why some float operations are commutative but not associative.

Even besides precision, there're other 'dark corners' in floating point math: special number for infinities and 'NaN' (it's really, really not a number so you can't use == to compare with it, actually NaN != NaN), sign of zeroes, rounding mode and exceptions. Just like the laws, boring, but sometimes helpful.


Also because some of our knowledge of the properties of real numbers do not apply to floating point math. For example, real arithmetic is associative and distributive but floating-point arithmetic is not.


It's important to recognize that there's no one-size-fits-all way to compare, even using epsilon. The page linked by @ZiCog earlier covers two "good" ways (along with a handful of "bad" ways) but all of them have tradeoffs. Not to mention that doing repeated arithmetic on floating-point numbers can cause the error to exceed any threshold if you happen to be unlucky enough that all the rounding errors point in the same direction.

In general, you need some information about your application to know what kind of comparison is "close enough". Sometimes only the exact same bit pattern is correct. Sometimes it's sufficient to know whether the numbers are within 1% of each other, or whether they're in the same "bucket" by some arbitrary standard. Sometimes you need extra logic around 0 or when the numbers are of opposite signs; sometimes you know by construction that isn't possible, so it would be a waste of time or even incorrect to try to account for it.

That is, to me, the most compelling argument against doing any kind of epsilon comparison by default: it's often not appropriate. In any given situation where you think you need floating point equality, you probably need either bitwise equality -- which is the simplest option -- or you actually need to compare the absolute difference against a threshold that is determined by your use case. Any global threshold, even one based on epsilon, is probably wrong more often than it is right.


Decades ago the project manager I was working for reviewed a new team members code and told him:
"if you think you need floating point to solve the problem then you don't understand the problem". He was kind of blunt that way.

On overhearing that I took it very practically, after all the processors we were using on that embedded system did not support floating point and the language we were using, Coral 66, had great fixed point type support.

As the decades rolled by I realized the corollary to that statement: "If your really do need floating point to solve the problem, then you have a problem you won't understand"

Threads about floating point, like this one, pop up all over the place from time to time. They always remind me of my old project manager.


Note that even if epsilon comparison were globally appropriate (which it's not, as you say), it's not a legal thing to do in a PartialOrd implementation anyway.

PartialOrd requires transitivity, that a=b ∧ b=c ⇒ a=c. But there's no epsilon for which that holds in general, since you can make a counterexample with a + ⅔ϵ = b = c - ⅔ϵ.

I would say this is overstated. I agree that for normal CS101 stuff you don't need floating-point. But if you have a scale-invariant problem, then floating point is completely appropriate -- and that's not uncommon for anything doing geometry. If it doesn't matter (other than constants, of course) whether you measured things in metres or feet, then floating-point is fine.

Of course, many people think they have a scale-invariant problem when they actually have a translation-invariant problem. My favourite example of someone running into that:

That's also a great example because it's also a demonstration that the floating-point was deterministic for them. It took some work, but like many RTS games before them, they had it producing consistent results even cross-platform:

At this point everything was going along fine. Contraptions were running exactly the same on Windows and Macs. Development was cruising along. Parts were being implemented. Floating point didn't seem to be causing a divergence to occur.


Personally I prefer when a language sticks to some standardised way of dealing with floats. It might not be intuitive but at least it's standardised so I only have to learn the behaviour once and not for each language out there.

Current Rust behaviour for a == b suits me. I wouldn't want Rust to apply epsilon in the comparison. If it did, should it then use epsilon for > and < too? Would it be possible for a == b and a < b and a > b to be all true?

My preference is to implement any epsilon calculation explicitly. Rust's traits make it more natural than some other languages. Also if I want total ordering of floats, it can be done using total_cmp.

Current Rust implementation also makes it impossible to use floating point types as keys in sets and dicts by default which I think is a good thing. Especially when you consider Nan as a key.

Perhaps I should say that my "corollary" was devised partially tongue in cheek.

It's just the simple notion that for many situations floating point is not even needed. Thus if you think you need floats you don't understand the problem. But if the problem really does call for using floats then you have to understand all of this: https://www.itu.dk/~sestoft/bachelor/IEEE754_article.pdf

Of course most of the time most programmers don't worry about all of that. Until they find something weird going on...

1 Like

May I know what's the book?


It may be tongue in cheek, but it's also patronizing. You can't just say "if you think you need float, you don't" because that implies floats are completely useless, not just that you probably don't need them. It's like saying "you'll understand when you're older" to someone who's currently willing and capable of understand something.

1 Like

I'm not sure how "tongue in cheek" it really is if you then double-down.

Floating-point is appropriate whenever you care about relative, not absolute differences. This is incredibly common in real life.

If you're making a baguette and you're a tablespoon short on yeast, you have a major problem. But if you're a tablespoon short on flour you won't even notice. That's because you actually care about relative differences -- within 10% is generally fine.

If your GPS is 5 minutes off going to the corner store you'll be annoyed, but if it's 5 minutes off for a Paris-to-Berlin drive you probably wouldn't notice. Because, again, you care about relative error in the estimate.

All approximations of real numbers have their issues; it's not like fixed-point is perfect either. Just like i32 isn't an integer. Programming is hard; that doesn't mean the problem is misunderstood.


This is a really nice thread - great resources and insights here. Floating Point is one of those roller-coaster rides of "I think I get it, no, I know nothing, I understand more, ..." so common throughout computer science. I found this fascinating article by Hans-J. Boehm, about the math library (in Java) he created to reduce the number of "bug" reports in the Android Calculator.

The library is described in a bit more detail in this paper.

I like this summary of the problem:

"We really want to ask an alternate, less well-studied, question: Can we decide equality for recursive reals computed from integer constants by a combination of the following operations, which we will refer to as "calculator operations":
1. The four basic arithmetic operations, and square roots.
2. The sin, cos, and tan trigonometric functions and their inverses.
3. Exponential and (natural) logarithm functions."

Which he says is mostly solved in the paper, "The Identity Problem for Elementary Functions and Constants"

I don't know of any ports of this work to other platforms or languages, but it seems like the kind of thing the Rust community will do, eventually.

I was reading meap 14 of "Rust in Action" and the code i'm talking about is in page 37. I don't know if there is a more recent release of the book that fixes this


If only IEEE was a thing of the past and a better format like posits [https://posithub.org/khub_doc] were used in modern hardwares and softwares instead. It wouldn't solve equality testing of course but fewer errors propagate with this format.

Posits are slightly more precise for their storage size, but don't solve catastrophic cancellation (nothing finite-sized can), so really don't fundamentally change anything. Numerically-stable algorithms are still required.