Why does the `mul_add` method produce a more accurate result with better performance?

Fused multiply-add. Computes (self * a) + b with only one rounding error. This produces a more accurate result with better performance than a separate multiplication operation followed by an add.

How is this method implemented? And how much better performance can it produce?

It's implemented using LLVM intrinsics, specifically llvm.fma.f64, which is documented here.

That statement in the docs isn't always true: it produces a more accurate result that may be faster than separate multiplication/addition. It's not actually faster on many platforms, since getting high performance basically requires having a specific fma CPU instruction. If not it calls the libc function, which is quite a lot slower (it does a lot more work):

test fma      ... bench:       211 ns/iter (+/- 7)
test separate ... bench:         1 ns/iter (+/- 0)
#![feature(test)]
extern crate test;

#[bench]
fn fma(b: &mut test::Bencher) {
    let x = 1.5_f64;
    let y = 2.123466;
    let z = -987654.23456;

    b.iter(|| test::black_box(x).mul_add(test::black_box(y),
                                         test::black_box(z)))
}

#[bench]
fn separate(b: &mut test::Bencher) {
    let x = 1.5_f64;
    let y = 2.123466;
    let z = -987654.23456;

    b.iter(|| test::black_box(x) * test::black_box(y) + test::black_box(z))
}
5 Likes

I think we should fix the docs then. On my machine, that does have FMA as far as I know, it is still slower:

test fma      ... bench:        29 ns/iter (+/- 1)
test separate ... bench:        23 ns/iter (+/- 1)