I wrote a simple compute benchmark in rust (translated from C++, in which it delivers 80% of peak flops - 95 / 120GFlops). I'm quite disappointed since I cannot make it deliver more than 3.5Gflop.
Assembly doesn't look to use avx2 or fma (available on my machine).
use std::time::{Instant};
use rayon::prelude::*; // is provided for parallel iterators
// this function can't be changed (may only add attributes)
#[inline]
fn expm1(x: f64) -> f64 {
return ((((((((((((((15.0 + x) * x + 210.0) * x + 2730.0) * x + 32760.0) * x + 360360.0) * x + 3603600.0) * x + 32432400.0) * x + 259459200.0) * x + 1816214400.0) * x + 10897286400.0) * x + 54486432000.0) * x + 217945728000.0) * x + 653837184000.0) * x + 1307674368000.0) * x * 7.6471637318198164759011319857881e-13;
}
// this function can't be changed (may only add attributes)
fn twelve(x: f64) -> f64 {
return expm1( expm1( expm1( expm1( expm1( expm1( expm1( expm1( expm1( expm1( expm1( expm1(x))))))))))));
}
// you might want to optimize this one as well
fn populate(data: &mut [f64]) {
data.par_iter_mut().for_each(|d| *d=0.1);
}
// optimize this one
fn apply(data: &mut [f64]) {
data.par_iter_mut().for_each(|d| *d=twelve(*d));
}
// you might want to optimize this one as well
fn verify(data: &[f64], n: usize) {
for i in 0..n {
let expected = twelve(0.1);
if expected != data[i] {
println!("error at {:?} - {:?} != {:?}", i, data[i], expected);
}
}
}
fn run(data: &mut [f64], n: usize) -> f64 {
populate(data);
let start = Instant::now();
apply(data);
let duration = (start.elapsed().as_millis() as f64) * 1.0E-3;
verify(data, n);
let gflops = (n as f64) * 12.0 * 15.0E-9;
println!("{:?}", gflops / duration);
return gflops / duration;
}
fn main() {
println!("avx2: {:?}", is_x86_feature_detected!("avx2"));
let cpu: usize = num_cpus::get();
println!("num cores {:?}", cpu);
let n: usize = cpu * 256 * 1024; // take n as large as possible
let mut input = vec![0.1;n];
let mut best : f64 = 0.0;
for _i in 0..10 {
let gflop = run(&mut input, n);
if gflop > best {
best = gflop;
}
}
println!("Metric : {:?} GFlop/s", best);
}
and cargo.toml :
[package]
name = "expm1"
version = "0.0.1"
edition = "2018"
[dependencies]
rayon = "1.5"
num_cpus = "1.13.0"
Since you've manually unrolled the C++ version, you could try the same with Rayon. At a high level, you could add with_min_len(32) to enforce a lower bound on the adaptive parallel splitting, but I'm not sure if that will be enough for LLVM to unroll more aggressively. You could instead use par_chunks_mut(32) or par_chunks_exact_mut(32), and then serially iterate each slice or write the unrolled update yourself.
Also, if your C++ test was compiled with gcc, you could try with clang to see how LLVM fares. If that's missing some optimization, then Rust will be similarly limited.