A strange performance issue that a/b is faster when `a,b` is `f64` than is `u32`

Neutron3529 · July 17, 2020, 8:50am

I tried to optimize the expression a/b when b is fixed (but unknown) since most of the cases, a.max(b)<1<<21 and a>=0,b>0

if we map b to b_mapped=((4398046511104_u64+b-1)/b, then a/b == a*b_mapped>>42 for each (a,b) satisfied a.max(b)<1<<21 and a>=0,b>0

to check whether the conversion is correct, I tried both the result and the computation time,

The results are definitely correct. but as for the computation time, strange things happened:

extern crate rayon;
use rayon::prelude::*;
fn main(){
    let b=std::time::Instant::now();
    let r1=(0u32..65536).into_par_iter().chunks(4096).map(|x|{
        x.iter().map(|&i|{
            (1u32..65536).into_iter().map(|j|{
                i/j
            }).sum::<u32>() as usize
        }).sum::<usize>()
    }).sum::<usize>();
    println!("{:?}",b.elapsed());
    let b=std::time::Instant::now();
    let a:Vec<usize>=(0..1).chain((1..2097152).map(|x|((4398046511104_u64+x-1)/x as u64) as usize)).collect();
    let r2=(0usize..65536).into_par_iter().chunks(4096).map(|x|{
        x.iter().map(|&i|{
            (1usize..65536).into_iter().map(|j|{
                i*a[j]>>42
            }).sum::<usize>()
        }).sum::<usize>()
    }).sum::<usize>();
    println!("{:?}",b.elapsed());
    let b=std::time::Instant::now();
    let r3=(0u32..65536).into_par_iter().chunks(4096).map(|x|{
        x.iter().map(|&i|{
            (1u32..65536).into_iter().map(|j|{
                (i as f64/j as f64) as u32
            }).sum::<u32>() as usize
        }).sum::<usize>()
    }).sum::<usize>();
    println!("{:?}",b.elapsed());
    println!("{}",(r1+r2)>>1-r3);
}

that program returns

neutron@Neutron:/me/rust/rayonops$ cargo run --release
   Compiling rayonops v0.1.0 (/me/rust/rayonops)
    Finished release [optimized] target(s) in 0.86s
     Running `target/release/rayonops`
1.739315795s
735.013066ms
957.63335ms
0
neutron@Neutron:/me/rust/rayonops$ cargo run --release
    Finished release [optimized] target(s) in 0.01s
     Running `target/release/rayonops`
1.77189577s
900.427822ms
1.013179198s
0
neutron@Neutron:/me/rust/rayonops$ cargo run --release
    Finished release [optimized] target(s) in 0.01s
     Running `target/release/rayonops`
1.778011042s
859.130591ms
1.003195896s
0
neutron@Neutron:/me/rust/rayonops$ cargo run --release
    Finished release [optimized] target(s) in 0.01s
     Running `target/release/rayonops`
1.53730607s
767.199922ms
1.048112309s
0

There is no doubt that a*b_mapped>>42 is faster than a/b

but it is quite strange that a/b is slower than a as f64/b as f64

what happened in the latter case?

Does f64/f64 really faster than u32/u32?

if not, why does u32/u32 so slow in Rust?

if so, why Rust does not convert u32/u32 to f64/f64 implicitly?

Are there any cases that u32/nonzero is not equals to f64/f64 as u32?

PS: tests without rayon

neutron@Neutron:/me/rust/rayonops$ ./main_speed_test 
10.20379502s
3.601069229s
6.565324578s
0

//without rayon
fn main(){
    let b=std::time::Instant::now();
    let r1=(0u32..65536).into_iter().map(|i|{
            (1u32..65536).into_iter().map(|j|{
                i/j
            }).sum::<u32>() as usize
    }).sum::<usize>();
    println!("{:?}",b.elapsed());
    let b=std::time::Instant::now();
    let a:Vec<usize>=(0..1).chain((1..2097152).map(|x|((4398046511104_u64+x-1)/x as u64) as usize)).collect();
    let r2=(0usize..65536).into_iter().map(|i|{
            (1usize..65536).into_iter().map(|j|{
                i*a[j]>>42
            }).sum::<usize>()
    }).sum::<usize>();
    println!("{:?}",b.elapsed());
    let b=std::time::Instant::now();
    let r3=(0u32..65536).into_iter().map(|i|{
            (1u32..65536).into_iter().map(|j|{
                (i as f64/j as f64) as u32
            }).sum::<u32>() as usize
    }).sum::<usize>();
    println!("{:?}",b.elapsed());
    println!("{}",(r1+r2)>>1-r3);
}

kornel · July 17, 2020, 10:16am

Try using bencher for more reliable results. Caches, threadpools and lots of noise from modern OSes make most DIY benchmarks misleading (e.g. all cost of lazy initialization is paid by the first check you run).

Modern CPUs have out-of-order execution, multiple independent ALUs, vector units, etc. Simple reasoning that fewer smaller operations are faster doesn't work any more. It's quite possible that f64 got vectorized better, or that the computation was faster because FP part of the CPU was idle while the integer part was busy, so you've got FP ops for "free".

Optimizers are weird too, and performance of numeric code usually depends a lot on optimizer's ability to apply autovectorization to use SIMD. It's possible that [j] has disabled optimizations, because indexing is an operation that can panic, and panicking is a precisely observable side effect, so the optimizer can't combine multiple calculations into one. You can use rust.godbolt.org to check if the panicking code is there.

Neutron3529 · July 17, 2020, 10:57am

You're right.

with disabling out-of-order execution, fp64/fp64 slow down significantly.

Thank you a lot!

kornel · July 17, 2020, 11:54am

What did you change?

The out-of-order execution is integral part of modern CPU design. It's not something you can disable without switching to a CPU architecture from the early 1990s

Neutron3529 · July 17, 2020, 4:46pm

change i/j to (result+const)/j to ensure the execution order

the only thing I want to figure out is that whether u32/u32 is faster than f64/f64
after changing i/j to (result+const)/j, I found u32/u32 faster than f64/f64, which implies that out-of-order execution did speed up f64/f64 but not u32/u32

kornel · July 17, 2020, 5:53pm

I strongly recommend that you use bencher to test the speeds, because a single elapsed() check may give you misleading results, and you may be drawing invalid conclusions from noisy numbers caused by irrelevant side effects and unisolated benchmark code.

You can also investigate what code is generated. You may be surprised that f64 i/j generates 62 instructions, of which only two are the division:

Neutron3529 · July 18, 2020, 11:44am

I'm quite sorry for my lazy uploaing trait...

Actually in the latter case, I had used the cargo bench command.

Here's what I wrote in `lib.rs`


#![feature(test)]
extern crate test;

#[cfg(test)]
mod tests {
    use super::*;
    use test::Bencher;

    #[test]
    fn it_works() {
        let r1=su32();
        let r3=sf64();
        assert_eq!(0, r3-r1);
    }

    #[bench]
    fn bench_u32(b: &mut Bencher) {
        b.iter(|| su32());
    }
    #[bench]
    fn bench_f64(b: &mut Bencher) {
        b.iter(|| sf64());
    }
    #[bench]
    fn bench_u32e(b: &mut Bencher) {
        b.iter(|| su32e());
    }
    #[bench]
    fn bench_f64e(b: &mut Bencher) {
        b.iter(|| sf64e());
    }
}
fn su32()->usize{
    (0u32..256).into_iter().map(|i|{
            (1u32..65536).into_iter().map(|j|{
                i/j
            }).sum::<u32>() as usize
    }).sum::<usize>()
}
fn sf64()->usize{
    (0u32..256).into_iter().map(|i|{
            (1u32..65536).into_iter().map(|j|{
                (i as f64/j as f64) as u32
            }).sum::<u32>() as usize
    }).sum::<usize>()
}
fn su32e()->usize{
            (1u32..2097152).into_iter().fold(1u32,|s,j|{
                (s/j).wrapping_mul(s+1)
            }) as usize
}
fn sf64e()->usize{
            (1u32..2097152).into_iter().fold(1u32,|s,j|{
                ((s as f64/j as f64) as u32).wrapping_mul(s+1)
            }) as usize
}

Here's the result.


neutron@Neutron:/me/rust/rayonops/src$ cargo bench
    Updating crates.io index
   Compiling autocfg v1.0.0
   Compiling maybe-uninit v2.0.0
   Compiling lazy_static v1.4.0
   Compiling cfg-if v0.1.10
   Compiling libc v0.2.72
   Compiling scopeguard v1.1.0
   Compiling rayon-core v1.7.1
   Compiling either v1.5.3
   Compiling crossbeam-utils v0.7.2
   Compiling memoffset v0.5.5
   Compiling crossbeam-epoch v0.8.2
   Compiling rayon v1.3.1
   Compiling num_cpus v1.13.0
   Compiling crossbeam-queue v0.2.3
   Compiling crossbeam-deque v0.7.3
   Compiling rayonops v0.1.0 (/me/rust/rayonops)
warning: function is never used: `su32`
  --> src/lib.rs:33:4
   |
33 | fn su32()->usize{
   |    ^^^^
   |
   = note: `#[warn(dead_code)]` on by default

warning: function is never used: `sf64`
  --> src/lib.rs:40:4
   |
40 | fn sf64()->usize{
   |    ^^^^

warning: function is never used: `su32e`
  --> src/lib.rs:47:4
   |
47 | fn su32e()->usize{
   |    ^^^^^

warning: function is never used: `sf64e`
  --> src/lib.rs:52:4
   |
52 | fn sf64e()->usize{
   |    ^^^^^

warning: 4 warnings emitted

    Finished bench [optimized] target(s) in 8.80s
     Running /me/rust/rayonops/target/release/deps/rayonops-9db5cc5e6fe781a2

running 5 tests
test tests::it_works ... ignored
test tests::bench_f64  ... bench:  15,689,291 ns/iter (+/- 568,453)
test tests::bench_f64e ... bench:  17,970,717 ns/iter (+/- 756,258)
test tests::bench_u32  ... bench:  26,728,849 ns/iter (+/- 210,731)
test tests::bench_u32e ... bench:  14,510,022 ns/iter (+/- 383,915)

test result: ok. 0 passed; 0 failed; 1 ignored; 4 measured; 0 filtered out

     Running /me/rust/rayonops/target/release/deps/rayonops-71007df303a96d34

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out

system · October 16, 2020, 11:44am

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.

Topic		Replies	Views
High performance operations	7	4129	January 12, 2023
Fast math seems ... slower? help	16	2058	June 30, 2020
New version of mandel-rust: uses Rayon, added benchmark announcements	38	5847	January 12, 2023
Unexpected performance from array bound tests and more	13	2217	January 12, 2023
Comparison of a simple code between Rust and C!	25	1962	January 12, 2023

A strange performance issue that a/b is faster when `a,b` is `f64` than is `u32`

Related topics