F32 -> f28, f29?

Suppose we have:

pub enum Foo {
  A(u32),
  B(u32),
  C(u32),
  D(u32),
  E(u32),
}

and space really mattered, we could pack a Foo into a u32 by deciding that all the u32's are actually u29's, i.e. illegal to store a value >= (1 >> 29), and manually writing Foo -> u32, u32 -> Foo functions.

Now, however, suppose we have:

pub enum Foo {
  A(u32),
  B(u32),
  C(u32),
  D(u32),
  E(u32),
  F(f32),
  G(f32),
}

There is a very clear definition of what a u29 is. Is there a simple definition for f29? Do we chop off mantissa or exponent bits ?

IEEE 754 only defines formats for 16 bits and multiples of 32 bits (but not 96 bits).

There are only 8 exponent bits in f32, so if you're chopping off only a few bits you probably want to remove them from the significand.

No.

Before it becomes a thing, no. Think bfloat16 vs FP16. One is 8+8, one is 5+11, both are used.

It's far more ambiguous than for integers and depends on the precision and range you need for the specific purpose. Once you don't have the constraint anymore that a floating-point type needs to be directly supported in hardware, you could also look further to something like posits, which have better precision for small bit widths. But conversion is more involved then.

In this case I would suggest considering whether a fixed point representation might fit your needs better. That is, interpreting an integer type as a number with a fixed number of bits after the binary point.

These have worked well for me in the past. Here’s an article that convinced me to give them a try.