Watch out for NaNs

I worked on basic FPS camera controls today in a game I'm prototyping. While implementing WASD, I accidentally tried to normalize a zero-length vector and ended up setting the velocity to NaN, kicking my camera into the void. When these bugs appear while you're working on the code, I guess it isn't that bad (in the Stockholm syndrome sense). At least it was obviously incorrect in this case. Still, I would have preferred a backtrace pointing out the bug.

And it's is my first day on this project. I wish I was joking. It's way too easy to get caught off guard by these issues and suddenly one NaN becomes an avalanche of uselessness.

That really doesn't help. Division by zero is meaningless. Do you also propose that an operation is only meaningless until someone finally gives it meaning? Supposing someone could prove that multiplying any real number by 0 results in infinity, then yeah, sure, they can define n / 0 = infinity.

Eh, this is behavior of a bizarre specification. As a corollary, using fixed point math inherits all of the behavior of integer math and all of its poor precision. On the bright side it wouldn't have obviously dubious behavior like allowing division by zero or allowing math operations with something that isn't even a number!

Do you have a provable lower bound on the length of the vector you're normalizing?

If yes, you can write a function like:

fn checked_normalize(v: Vector, min_length: f64) -> Vector

that will panic when the vector doesn't have expected length.

If you don't have a bound on the length of the vector, you should change your normalization function to one that is robust to zero length vectors, i.e. returns a unit length vector even in that case. Because if it can possibly happen in the game, then you don't want a panic and you don't want your camera to blow up.

For the case of games, a more general design rule for robustness, that I've been using: any time you are storing a value that will be used in the next frame, it's worth checking for NaN and doing something else (perhaps reset to zero, or use the previous frame's value). That way, the game state never becomes irrecoverably unusable even in the case of a bug — or at least it is parsable as one object glitching out rather than total failure.

Of course, this isn't a replacement for getting the actual NaN-creating bug out of the code; just an error recovery strategy.

Apparently, it's zero! But that was unintentional.

The problem is that I need to write this code at all instead of letting the compiler insert debug_assert for divide by zero and arithmetic with NaN like it does for integer overflow. (Integer divide by zero asserts in release mode, too!)

I know why it doesn't do this; it would be highly controversial. Even though in practice, for the entirety of my career, I have never once wanted to divide by zero or propagate NaN through any of my calculations. Not in game dev, not in web dev, not in kernel dev, not in distributed systems. The only exception is in emulation, where emulating CPUs with IEEE-754 are held hostage by its specification. In all that time, I've never even wanted to exploit NaN for uses like NaN boxing. It has been my experience that if you ever see NaN, or infinity resulting from division by zero, that's a bug.

It isn't my function. It's provided by the game engine. I don't know if I am going to make it a new type yet, but that's an option. One which is admittedly error-prone.

But I do! At least while coding and testing in debug mode. The camera will blow up anyway, crucially, if NaN ever enters the arithmetic that updates its transformation matrix. If the player gets kicked into a void, that's just as bad as the program crashing. Or probably worse, because it's harder to debug and it doesn't present any information about where this error occurred.

The question I meant was: if you had correct code and hadn't introduced a bug, would you have a lower bound on the length of the vector. Do you know it's supposed to be at least 0.5 long or something like that?

For instance, if the vector you're normalizing is a cross product, and the arguments to the cross product could be very close to co-linear, you could get a very short vector or even a zero vector (because the vectors really are co-linear, or because of rounding errors).

Well you can offload the task to the game library, but then the same comments will apply to what the library does.

It doesn't really matter whether it's you writing the code or some library dev writing the code when we're discussing what the semantics of floats should be (unless the library defines its own floats).

The robust normalization function I described wouldn't introduce NaNs. Like I said, it would return a unit vector, not a vector full of NaNs.

I see, excluding the bug, the lower bound on the vector length would be 1. For instance, pressing W would use the forward unit vector.

(The library does not define its own floats.)

In my opinion, the semantics of float operations should do the least surprising thing. In contrast, I love the borrow checker, because I can rely on it. I know it has my back. What we have today with floats allowing divide by zero and viral NaNs is passive aggressive at best, and adversarial at worst. I want to "fight the float checker" because I know it has my back and it will help me produce better code.

I don't want to define a concrete result for normalizing a zero vector. It seems like in the general case, one would want to handle this in application-specific ways. But in almost no case should the code continue on its merry way into the weeds.

I have in no way done any exhaustive research or explored every domain where floats are used. But a quick search on the terms "GPU NaN" turns up plenty of discussion in ML and game dev communities where the common suggestion is "avoid producing NaN in the first place". As wise as this suggestion is, it appears at first glance that avoiding NaN is its most common use.

Well, in this case you don't have to normalize anything, that vector is already normalized.

Let me just say a more formal reason why it's fine to do that in almost every application (and similarly, to define 1/0 to be max float or whatever).

The reason is backward error analysis.

What is usually understood by "numerical stability" of a function f(a, b, c) is that the result it gives is the correct result for slightly perturbed input data, f(a + epsilon1, b + epsilon2, c + epsilon3). As long as that happens, we are happy.

In the case of normalizing the vector (0, 0, 0), we can maintain numerical stability if we just pretend that it's really (0, 0, epsilon) for some tiny epsilon, and normalize it to (0, 0, 1). This maintains numerical stability by the above definition. It doesn't break anything, because we already have to make the assumption that the input data is not perfectly accurate -- it's never perfectly accurate.

That's exactly what I'm proposing!

Now I understand that NaN helps to debug stupid bugs like "I forgot to initialize things and they are all zeroes" and then you get a NaN a few operations later. I just think that this way of debugging is overrated. If you forgot to initialize things, your first unit test that covers that code will notice it.

Wait up a minute. Looks like it breaks something. You end up with a vector that can be pointing in totally the wrong direction and yet looks like a reasonable result. Sounds pretty catastrophic to me.

In general I would like my programs to give up and die when I have asked them to do something they cannot do. At least with a NaN result the program comes back and says "sorry I can't do that" rather than lying to me with a wrong result.

1 Like

You're just taking one statement from what I said out of context. Re-read the definition of backwards error analysis I was talking about here. It's not a "totally" wrong direction, it's a numerically valid error because it modified the inputs by a small epsilon, which, as I took pains to describe, is allowed by the rules I was talking about.

We were talking about the lower bound. :confused: If the user presses W and D together, the resulting vector is no longer a unit, since it will be a composition of both the forward and right vectors. The normalization is necessary. But I think we're starting to get off-topic. I would like to focus on the concerning behavior of floats, and not on how I chose to implement this particular bug.

The reason I don't want to define normalization like this is because what I wanted in this particular case is a zero vector, not a unit vector, when the input is zero. Which is incompatible with the example you've just provided.

Yes, me too. :slight_smile: A panic would avoid producing NaN in the first place. It would also give me a very useful back trace that points at which line of code I made the mistake. No more "I wonder if I forgot to add is_nan() checks everywhere?". This is symmetrical to if err != nil in Go and if (ptr == NULL) in C. I really don't want the easy thing (laziness/forgetfulness) to be the wrong thing (garbage computations/missing error handling).

The very existence of Unums and Posits is evidence that the status quo is irreparably broken, and some folks are taking on the challenge of correcting the course. Personally, I'm not too interested in these avenues at the moment. I really just want to continue using what is well-supported, but with a little extra safety net where it is deserved.

I agree that this is not an ideal way to debug. Quite the opposite. NaN is basically a non-signaling error. I honestly cannot think of an analog for it. Except maybe smashing the stack with a buffer overflow and changing the return pointer. In the one case you end up with a corrupted computation. In the other case you end up with corrupted control flow. Why one of these is OK and the other is not is beyond me.

It's not incompatible, because I was talking about normalization of a vector, which implies the output is a unit-length vector.

You're talking about a different operation that sometimes returns 0, which is not a unit vector, so shouldn't really be called "normalization", it's confusing when you call it that. It's "normalization if a key is pressed, zero if it's not pressed".

Anyway, your operation is fine and doesn't have any issues, there is no problem to solve there (other than if you have a bug of course).

You just said you want the zero vector for zero inputs, then I don't understand where the panic comes in.

Anyway, if you want to panic in the case where some key is pressed and the vector is too short, you should just use my checked_normalize(v, 1.0) function and that's exactly the behavior you will get. I say that's a better solution than what you described with programming language panics because:

  1. It will also fail when the input has length 0.1, so it catches more bugs.
  2. It will not prevent other use cases where division by 0 shouldn't panic (sometimes a 0 is just underflow of a positive number, for example).

Not really. I'm talking about forgetting to make the normalization conditional. This is different from numerical stability or defining what vector normalization does.

This is a misinterpretation. I don't actually want to call Vec3::ZERO.normalized() ever, for any reason. And I definitely don't want that to produce Vec3::ZERO. I want that when this happens by mistake I get a really loud runtime error. Which would force me to do what I did anyway; make the call conditional.

Again: what you want is checked_normalize(v, 1.0) (or checked_normalize(v, 0.9)) which will give you exactly what you want in this case, and more (it will detect more bugs). It will give you a very loud error if your vector doesn't have the properties you expect it to have, namely: length at least 1.

No reason to prevent other places in your code from using a different lower bound -- or even zero lower bound, in some cases.

BTW checked_normalize is perhaps not the best name for these, since these usually return Option. Perhaps checked_normalize().unwrap() or verified_normalize is better.

And herein lies the problem. I have to opt-in to sane calculations when floats are involved. By default, you end up with nonsense precisely because floats are allowed to do meaningless things like return NaN for 0 / 0. And somewhere, someone will not accept paying the cost of checked arithmetic for impl Div<f32> for f32. So we all get punished, forcing us to go out of our way to avoid corrupting state.

I still think checking the denominator with debug_assert is a reasonable compromise. Literally anything is better than silently letting everything fall apart.

I fail to see how that is meaningless. 0/0 has no defined value in maths and its NaN in floats. All lines up to me.

For a good reason. If you want a range check, you have to provide what range you expect -- hence the 1.0 argument you would provide to my checked_normalize function, because you know it's at least 1.0.

But what you're proposing is that all divisions should have an automatic f64::MIN_POSITIVE range check.

Two problems with that:

  • that threshold is often too small, for example you want 1.0 not f64::MIN_POSITIVE
  • that threshold is sometimes too large, when people have no idea how small their numbers can get

For these reasons, the default shouldn't have a range check and should just try to do what is numerically stable. I already explained how normalizing vectors can be made numerically stable -- NaN for zero vectors makes it numerically unstable in the neighborhood of tiny vectors. If you want a range check, you have to specify the range that depends on the application -- in your case, 1 is the threshold.

Which is another way to say that it is meaningless. And I already said NaN is basically a non-signaling error. We agree on implementation specifics, but not on the "definition of meaningless".

NaN certainly has a meaning. It means "You know that number you were asking me to divide by some other number? Well the result of that division has no defined value".

That is: "0/0" has no meaning (defined value) but saying "The result of some X/Y has no defined value" does have meaning.

In that way NaN is not an error. It's a valid answer to the problem given. It's only an error if you decide it is in your context.

This is all getting too zen for me :slight_smile:

1 Like

I didn't say NaN has no meaning, I said dividing by zero is meaningless. What I said about NaN is that it is basically a non-signaling error. I know communication is hard, but I'm trying to be as unambiguous as I can.

I also think that introducing NaNs was a mistake. But I strongly disagree with the sqrt(-1) == 0 approach. A programming mistake (does not matter if it comes from a badly handled precision loss) is a mistake and it should not be silenced.

I like how posits handle it (see the section 2.4). They do not have bit representation for NaN, any incorrect operation results in an interrupt, which may be optionally processed. Yes, it has a certain (somewhat manageable) cost from hardware design point of view, but I think a saner float programming would've been enough to offset this.

But, unfortunately, we are strongly tied to IEEE-754 and I don't think we will be able to migrate from them without some cataclysmic events. So we have to live with what we have.