Fast math seems ... slower?

I ran the following basic test to give a first glance at performance improvement between usual math operations and fast math (with nightly rust):

#![feature(core_intrinsics)]

extern crate rand;

use std::vec::Vec;
use std::intrinsics::*;
use std::time::{Instant, Duration};

fn main() {
    let loop_count = 10_000_000;
    let duration_fast: Duration;
    let duration_slow: Duration;
    let mut tuples = Vec::<(f64, f64, f64)>::new();
    let mut x = 0.5f64;
    for _ in 0..loop_count {
        x = (0.78f64 * x + 0.22f64).fract();
        let y = (0.12f64 * x + 0.13f64).fract();
        let z = (0.54f64 * x + 0.07f64).fract();
        tuples.push((x, y, z));
    }

    {
        let mut p = 1.0f64;
        let now = Instant::now();
        for tuple in (&tuples).into_iter() {
            // chain polynomials
            unsafe {
                p = fadd_fast(
                    fmul_fast(
                        fsub_fast(fmul_fast(tuple.0, p), tuple.1),
                        p),
                    tuple.2);
            }
        }
        duration_fast = now.elapsed();
    }
    {
        let mut p = 1.0f64;
        let now = Instant::now();
        for tuple in (&tuples).into_iter() {
            // chain polynomials
            p = (tuple.0 * p - tuple.1) * p + tuple.2;
        }
        duration_slow = now.elapsed();
    }
    println!("fast:  {}.{:09} s", duration_fast.as_secs(), duration_fast.subsec_nanos());
    println!("slow:  {}.{:09} s", duration_slow.as_secs(), duration_slow.subsec_nanos());
}

and it turned out that the expected faster version was in reality slower:

fast:  0.407347297 s
slow:  0.249692092 s

I was originally running this on my laptop on windows10 64 and wondered if the feature was unavailable or unoptimized for Windows, but I get similar results on Rust Playground. Is there any kind of additional flag necessary to activate fast math? Or is there something fundamentally wrong with my code?

Are you using cargo run --release?
By default cargo makes a debug build which may be slower.

1 Like

1 run in release mode.
2 swap the code ordering and see the times swap.

Also, you might want to try and use the fma instructions https://doc.rust-lang.org/std/intrinsics/fn.fmaf32.html
As your chain polynomials are basically two fma instructions

@jonh It clearly looks like in release mode compiler just throws almost everything away: I get numbers like

fast:  0.000001584 s
slow:  0.000000760 s

in playground with your code in release mode and

Standard Error
   Compiling playground v0.0.1 (/playground)
    Finished release [optimized] target(s) in 0.82s
     Running `target/release/playground`
inf
inf
Standard Output
fast:  0.047023889 s
slow:  0.047111321 s

(exact numbers may change significantly, and fast is not always faster than slow) in release mode if I add eprintln!("{:?}", p); just after line which assigns duration_… so that p will be used. Please write correct benchmarks, speculating on the current one will get you nowhere. Also are you sure that p should be inf at the end?

There are tons of ways micro benchmarks like this can go wildly wrong, from cpu warmup to memory boundaries to cpu caches. I don't know the extent of what is necessary to get fully accurate readings. However, you can be sure that running in debug mode will give fairly meaningless data, and you should probably be including some warmup cycles which aren't counted in both of these.

Besides that, though, using an established benchmarking crate can help do all of these extraneous things right so that you can measure what you want to. criterion is a current go-to benchmarking crate for rust which will automatically handle black-boxing values and doing a good number of loops with accurate timing.

Printing to stdout will add even more variance, I think. The standard way to test this sort of thing is a "black box" function which the compiler cannot / will not optimize out. One is provided as test::black_box, and criterion provides an identical criterion::black_box.

I'd recommend redoing this as a set of criterion benchmarks and using black_box on the input of the benchmarks and the output values to ensure that the compiler doesn't optimize out the calculations.

3 Likes

I think that the following code should do the trick: put given files in some empty directory:

(./benches/fastmath.rs):

#![feature(core_intrinsics)]

extern crate criterion;
extern crate rand;

use std::vec::Vec;
use std::intrinsics::*;
use criterion::{criterion_group, criterion_main, Criterion};

fn fast(tuples: &[(f64, f64, f64)], pref: &mut f64) {
    let mut p = 1.0f64;
    for tuple in (&tuples).into_iter() {
        // chain polynomials
        unsafe {
            p = fadd_fast(
                fmul_fast(
                    fsub_fast(fmul_fast(tuple.0, p), tuple.1),
                    p),
                tuple.2);
        }
    }
    *pref = p;
}

fn slow(tuples: &[(f64, f64, f64)], pref: &mut f64) {
    let mut p = 1.0f64;
    for tuple in (&tuples).into_iter() {
        // chain polynomials
        p = (tuple.0 * p - tuple.1) * p + tuple.2;
    }
    *pref = p;
}

fn criterion_benchmark(c: &mut Criterion) {
    let loop_count = 10_000_000;
    let mut tuples = Vec::<(f64, f64, f64)>::new();
    let mut x = 0.5f64;
    for _ in 0..loop_count {
        x = (0.78f64 * x + 0.22f64).fract();
        let y = (0.12f64 * x + 0.13f64).fract();
        let z = (0.54f64 * x + 0.07f64).fract();
        tuples.push((x, y, z));
    }
    let mut res = [0.0; 2];
    c.bench_function("fast", |b| b.iter(|| fast(&tuples, &mut res[0])));
    c.bench_function("slow", |b| b.iter(|| slow(&tuples, &mut res[1])));
    eprintln!("{:?}", res);
}

criterion_group!(benches, criterion_benchmark);
criterion_main!(benches);

(./Cargo.toml)

[package]
name = "fastmath"
version = "0.0.0"
edition = "2018"

[dependencies]
criterion = "0.3"
rand = "0.7"

[[bench]]
name = "fastmath"
harness = false

then cd there and run cargo bench. I currently do not have nightly compiler to test that, so can’t say what the results will be though, but if I run that myself with some adjustments (commented out intrinsics import and feature use and using functions like unsafe fn fadd_fast(a: f64, b: f64) -> f64 { a + b } instead which obviously would not use fast math) I get benchmark running for ten minutes with results like

   Compiling fastmath v0.0.0 (/home/zyx/tmp/rust/fastmath)
    Finished bench [optimized] target(s) in 9.05s
     Running target/release/deps/fastmath-6c9080db52586cac
Benchmarking fast: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 283.3s or reduce sample count to 10.
fast                    time:   [56.297 ms 56.459 ms 56.617 ms]                  
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe

Benchmarking slow: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 292.0s or reduce sample count to 10.
slow                    time:   [56.104 ms 56.254 ms 56.416 ms]                  
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild

[inf, inf]
cargo bench  601,19s user 0,89s system 101% cpu 9:52,99 total

(which obviously just shows that compiler is able to inline my fadd_fast functions, mainly done that to be sure that my code is correct).

Printing to stdout will add even more variance, I think.

This is exactly why I did that only after assigning duration_…. Still has great (40..60 ms, with “fast” being both faster and slower) variance on playground though and playground has no criterion. You may wish to review my previous comment, AFAIR this is the first time I ever used criterion.

After a bit of testing and benchmarking using criterion the following were my results:
The slow and fast version both behave the same and my guess would be they both get compiled to the same code.
A version using the fma instruction improves performance (on my system from 41us to 25us).

benches/my_bench.rs
Cargo.toml

You can just put the files in a new empty project.

Yes sorry I forgot to run in release. However I would have naively expected that without optimization both cases give statistically similar results.

I didn't know about this one but actually the polynomial used above is just part of a toy benchmark and does not represent the actual computation I'd like to do. That said I may want to use fmaf64 in future code. Thanks.

It seems that I'm the addresse, not @jonh. Please make constructive comments, "write correct benchmark" is as helpful as "please don't introduce bugs".

Actually I had averaged durations over ten consecutive intertwinded iterations over my loops but didn't include it in the code to keep it simpler.

Thank you for the recommandation.

This is also what I noticed. Now oddly enough when I try to encapulsate computations in a dedicated type the performance degrades...

Toy benchmarks are good and all but remember that compilers do a lot of optimizations. I would recommend first profiling your code (there are some tools out there to get flamegraphs from your programs) and checking if something really is a bottleneck before starting to optimize parts of it.
That being said, I found it really difficult to benchmark compiled code with all compiler optimization turned on since many function boundaries get removed.
For a rough estimate however setting the optimization level to 1 (optimizes function bodies, but does not inline them afaik) always helped me

I'm not exactly sure what you mean, could you post what you did?

I just wanted to note that this is generally not true - not in Rust, at least. There are tons of things which the compiler trivially optimizes out in release mode, but doesn't in debug mode. These things, like iterators, function calls to inlineable functions, etc., cost nothing in release mode, but can be pretty bad in debug mode. The problem is that code can introduce an arbitrary number of these things, and while they have no affect on performance in release mode, they can greatly change timings in debug mode.

If all things were equally hard to optimize, then I'd agree with you. But they aren't.

1 Like

Yes but you seem to miss that before profiling anything you must learn how to profile. Of course compilers are doing a lot of arcane optimizations. That was exactly the purpose of this attempt for benchmarking: having a first experience of this variability in the particular context of fastmath and checking if it may or may not give interesting results.

See here for some more relevant benchmarks. This is related to a question I asked myself a long time ago but didn't take the time to investigate until now: fast math operations are unsafe because they may produce nonsense if operands have invalid content. But there exist more elaborated operations than the 4 classical arithmetic ones that actually preserve subdomains and where fast math could be used in theory. So I was wondering if it was worth it to safely encapsulate these operations inside a type and thus to benefit from fastmath inside. Of course this kind of compartmentalization may degrade the capacity of the compiler to optimize on a broader scope. This is basically what I was curious to check.

Sure but in that precise case it could have been considered reasonable to assume that in debug mode fast math operations just default to standard ones. It turns one that not only fast math operations are still different in debug mode but they are actually slower. I don't mean that the assumption above was the only one acceptable but at least it just seemed possible.

1 Like

I don't agree. Two differing definitions of floating-point arithmetic are involved that create different results for inputs that are outside the fast math domain. If the tests in the code verified fast-math operations for any inputs in the fast-math-excluded region, the debug-mode test would give different results than deployable code.

But AFAIU any input that is outside the fast math domain is UB if provided to fast math operation. In case of UB you cannot expect any consistency anyway.

1 Like

Just adding my cents. In my opinion, microbenchmarking is not an easy task and error-prone. And in this case, because the target code is very short, it is significantly easier to read the output assembly.

You can see that

  1. without -C target-cpu=native, the outputs are exactly the same.
  2. with -C target-cpu=native, the outputs are different. And the difference is vfmsub213sd instruction, where you can read the instruction description by hovering mouse pointer, it is an FMA operation. Thus, you can expect the fast math code is faster when compiled with the additional compiler option.
3 Likes

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.