Strategies for converting a floating point literal to the correct precision

What are some reasonable strategies for converting a floating point literal into the correct precision within a routine?

As an example, consider the following code using f128

// External dependencies
use ::f128::f128;     

fn main() {                    
    // Inject a float into f128      
    let x : f128 = 0.45.into();      
    let y = f128::parse("0.45").unwrap();                   
                      
    // Print the value
    println!("x: {:?}",x);
    println!("y: {:?}",y);          
} 

After running we receive:

x: 0.450000000000000011102230246251565404
y: 0.45000000000000000000000000000000001

Basically, we lose precision with using the straightforward into() function. In order to obtain the correct precision, we needed to convert from a string.

More generally, I run into this a fair amount when writing numerical code that's general to floating point types. Basically, a function of the form fn foo <T> (x : T) -> T. Often, I have some pretty simple numbers, literals, that I need to use as parameters in an algorithm. I'd like to have that literal translated to the correct precision regardless if we use f32, f64, f128, or whatever as our type parameter. Outside of converting that number from a string, which is pretty slow when done often, is there a good strategy for injecting such constants?

In this particular situation, what about using the f128::f128! macro?

If I understand it correctly, looks like this will parse a number at compile time and insert code with the right byte values to create the f128. This is a common strategy if you have something that can be parsed (and possibly validated) at compile time, and there are advantages to doing that over doing it at runtime.


For the general case, I'm not sure I have any good advice. Maybe the best way would be to create the literal with the highest-precision (f128), and always convert "downwards" from there?

Ah, cool! I didn't know about the f128! macro. In case anyone else is curious about working code, it requires some features to work

// Needed for f128 macro
#![feature(proc_macro_hygiene)]

// External dependencies
use ::f128::{f128,f128_inner};

fn main() {
    // Inject a float into f128
    let x : f128 = 0.45.into();
    let y = f128::parse("0.45").unwrap();
    let z = f128!(0.45);

    // Print the value
    println!("x: {:?}",x);
    println!("y: {:?}",y);
    println!("z: {:?}",z);
}

which gives

x: 0.450000000000000011102230246251565404
y: 0.45000000000000000000000000000000001
z: 0.45000000000000000000000000000000001

As far as the general case, I in essence do you what you suggested now. Basically, I create the literal as an f64 and then downcast it to f32 when that code runs. It works, but I was hoping for something better in the case that f80 or f128 gets implemented more natively. Though, your suggestion would work more reliably for now at the cost of requiring nightly.

If anyone else has a better suggestion, I'm open to it. Thanks for the pointer to the macro above.

Casting from a higher-precision literal to a lower precision number doesn't always produce the correctly rounded result. The phenomenon is known as double rounding https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=ab79e8720125062d972f6193fb7cf78f.

1 Like

Well, darn. Nice example. What's the correct way to accomplish this, then?

Other than string parsing, representing the value as the ratio of two integers (or some value that can be exactly represented) can be used (though not always applicable).

fn generic<T: Div<Output = Self> + From<i16>>() -> T {
    // 0.45
    T::from(45) / T::from(100)
}