Why does the `mul_add` method produce a more accurate result with better performance?


#1

https://doc.rust-lang.org/stable/std/primitive.f64.html#method.mul_add

Fused multiply-add. Computes (self * a) + b with only one rounding error. This produces a more accurate result with better performance than a separate multiplication operation followed by an add.

How is this method implemented? And how much better performance can it produce?


#2

It’s implemented using LLVM intrinsics, specifically llvm.fma.f64, which is documented here.


#3

That statement in the docs isn’t always true: it produces a more accurate result that may be faster than separate multiplication/addition. It’s not actually faster on many platforms, since getting high performance basically requires having a specific fma CPU instruction. If not it calls the libc function, which is quite a lot slower (it does a lot more work):

test fma      ... bench:       211 ns/iter (+/- 7)
test separate ... bench:         1 ns/iter (+/- 0)
#![feature(test)]
extern crate test;

#[bench]
fn fma(b: &mut test::Bencher) {
    let x = 1.5_f64;
    let y = 2.123466;
    let z = -987654.23456;

    b.iter(|| test::black_box(x).mul_add(test::black_box(y),
                                         test::black_box(z)))
}

#[bench]
fn separate(b: &mut test::Bencher) {
    let x = 1.5_f64;
    let y = 2.123466;
    let z = -987654.23456;

    b.iter(|| test::black_box(x) * test::black_box(y) + test::black_box(z))
}

#4

I think we should fix the docs then. On my machine, that does have FMA as far as I know, it is still slower:

test fma      ... bench:        29 ns/iter (+/- 1)
test separate ... bench:        23 ns/iter (+/- 1)