Hello Rustaceans,
I am currently writing a HPC benchmarking suite in Rust with simple kernels like copy, update, triad, daxpy, sum etc. The mail goal of this suite is to saturate the memory bandwidth and see how much memory bandwidth each kernel can use. The original benchmarking kernels were written in C with OpenMP for parallelism.
For example, the Copy kernel in C with OpenMP (static scheduling) is written as:
double copy(
double * restrict a,
double * restrict b,
int N
)
{
double S, E;
S = getTimeStamp();
#pragma omp parallel for schedule(static)
for (int i=0; i<N; i++) {
a[i] = b[i];
}
E = getTimeStamp();
return E-S;
}
And I am trying to migrate such benchmarks in Rust with Rayon. Below is the best performing kernel I could write in Rust with Rayon.
pub fn copy(c: &mut [f64], a: &[f64], n: usize) -> f64 {
let c_iter = c.par_chunks_mut(n);
let s = Instant::now();
// Parallel version
c_iter.for_each(|c_slice| {
c_slice
.iter_mut()
.enumerate()
.for_each(|(i, val)| *val = a[i])
});
s.elapsed().as_secs_f64()
}
pub fn main() {
let threads = available_parallelism().unwrap().get();
let vec_a: Vec<f64> = (0..120_000_000).into_par_iter().map(|_| 2.0).collect();
let mut vec_c: Vec<f64> = (0..120_000_000).into_par_iter().map(|_| 0.5).collect();
let t1 = copy_1(&mut vec_c, &vec_a, 120_000_000 / threads);
println!("Copy 1 took : {t1} sec");
}
The main point of getting #threads
and dividing the problem size is to give each thread contiguous equal amount of work. Like, thread #0
gets first 0..x elements, thread #1
gets the next contiguous x+1..y elements and so on. This is how OpenMP static scheduling works.
I used these flags in ./cargo/config.toml:
[x86_64-unknown-linux-gnu]
rustflags = [
"-C",
"target-cpu=native",
"-C",
"llvm-args=-ffast-math",
"-C",
"opt-level=3",
]
My problem is not being able to extract the same performance as the C version.
I am using Intel SapphireRapids 8470 ; dual socket with 104 cores and max theoretical memory bandwidth of 600 GB/s.
The C version with OpenMP takes on avg 0.0052
seconds:
Function Rate(MB/s) Rate(MFlop/s) Avg time Min time Max time
----------------------------------------------------------------------------
Copy: 370203.49 - 0.0052 0.0052 0.0053
----------------------------------------------------------------------------
But the Rust version with Rayon takes on avg 0.0083
seconds:
Function Rate(MB/s) Rate(MFlop/s) Avg time Min time Max time
----------------------------------------------------------------------------
Copy: 247803.58 - 0.0083 0.0077 0.0086
----------------------------------------------------------------------------
Ofcourse it does not seem fair to compare C+OpenMP with Rust+Rayon.
But I want to understand where I am losing performance. My belief is that if C can achieve such performance, so can Rust.
I have also tried multiple manual versions of trying to vectorize the code with AVX512 instructions in unsafe Rust, manual loop unrolling etc. But nothing gets past this barrier of 0.0077
seconds in Rust.
My questions are:
- Is this performance overhead from rayon ? From flamegraphs, I could not see much of overhead from rayon functions.
- Am I missing something while writing the parallelized kernel that could potentially stop me from extracting the same performance as C ?
Thank you in advance to the Rust community. =)