I tried to optimize the expression a/b
when b
is fixed (but unknown) since most of the cases, a.max(b)<1<<21
and a>=0,b>0
if we map b
to b_mapped=((4398046511104_u64+b-1)/b
, then a/b == a*b_mapped>>42
for each (a,b) satisfied a.max(b)<1<<21
and a>=0,b>0
to check whether the conversion is correct, I tried both the result and the computation time,
The results are definitely correct. but as for the computation time, strange things happened:
extern crate rayon;
use rayon::prelude::*;
fn main(){
let b=std::time::Instant::now();
let r1=(0u32..65536).into_par_iter().chunks(4096).map(|x|{
x.iter().map(|&i|{
(1u32..65536).into_iter().map(|j|{
i/j
}).sum::<u32>() as usize
}).sum::<usize>()
}).sum::<usize>();
println!("{:?}",b.elapsed());
let b=std::time::Instant::now();
let a:Vec<usize>=(0..1).chain((1..2097152).map(|x|((4398046511104_u64+x-1)/x as u64) as usize)).collect();
let r2=(0usize..65536).into_par_iter().chunks(4096).map(|x|{
x.iter().map(|&i|{
(1usize..65536).into_iter().map(|j|{
i*a[j]>>42
}).sum::<usize>()
}).sum::<usize>()
}).sum::<usize>();
println!("{:?}",b.elapsed());
let b=std::time::Instant::now();
let r3=(0u32..65536).into_par_iter().chunks(4096).map(|x|{
x.iter().map(|&i|{
(1u32..65536).into_iter().map(|j|{
(i as f64/j as f64) as u32
}).sum::<u32>() as usize
}).sum::<usize>()
}).sum::<usize>();
println!("{:?}",b.elapsed());
println!("{}",(r1+r2)>>1-r3);
}
that program returns
neutron@Neutron:/me/rust/rayonops$ cargo run --release
Compiling rayonops v0.1.0 (/me/rust/rayonops)
Finished release [optimized] target(s) in 0.86s
Running `target/release/rayonops`
1.739315795s
735.013066ms
957.63335ms
0
neutron@Neutron:/me/rust/rayonops$ cargo run --release
Finished release [optimized] target(s) in 0.01s
Running `target/release/rayonops`
1.77189577s
900.427822ms
1.013179198s
0
neutron@Neutron:/me/rust/rayonops$ cargo run --release
Finished release [optimized] target(s) in 0.01s
Running `target/release/rayonops`
1.778011042s
859.130591ms
1.003195896s
0
neutron@Neutron:/me/rust/rayonops$ cargo run --release
Finished release [optimized] target(s) in 0.01s
Running `target/release/rayonops`
1.53730607s
767.199922ms
1.048112309s
0
There is no doubt that a*b_mapped>>42
is faster than a/b
but it is quite strange that a/b
is slower than a as f64/b as f64
what happened in the latter case?
Does f64/f64 really faster than u32/u32?
if not, why does u32/u32 so slow in Rust?
if so, why Rust does not convert u32/u32 to f64/f64 implicitly?
Are there any cases that u32/nonzero is not equals to f64/f64 as u32?
PS: tests without rayon
neutron@Neutron:/me/rust/rayonops$ ./main_speed_test
10.20379502s
3.601069229s
6.565324578s
0
//without rayon
fn main(){
let b=std::time::Instant::now();
let r1=(0u32..65536).into_iter().map(|i|{
(1u32..65536).into_iter().map(|j|{
i/j
}).sum::<u32>() as usize
}).sum::<usize>();
println!("{:?}",b.elapsed());
let b=std::time::Instant::now();
let a:Vec<usize>=(0..1).chain((1..2097152).map(|x|((4398046511104_u64+x-1)/x as u64) as usize)).collect();
let r2=(0usize..65536).into_iter().map(|i|{
(1usize..65536).into_iter().map(|j|{
i*a[j]>>42
}).sum::<usize>()
}).sum::<usize>();
println!("{:?}",b.elapsed());
let b=std::time::Instant::now();
let r3=(0u32..65536).into_iter().map(|i|{
(1u32..65536).into_iter().map(|j|{
(i as f64/j as f64) as u32
}).sum::<u32>() as usize
}).sum::<usize>();
println!("{:?}",b.elapsed());
println!("{}",(r1+r2)>>1-r3);
}