Hi, everyone!
Before you close the tab I can tell you that I built Rust code with --release
option
What I am trying here is to achieve at least the same order of magnitude with C++ in this simple benchmark.
Here are the details: the idea of this benchmark is to sum all elements in a vector that are greater than 0.5. Here is the reference code using NumPy:
In [1]: import numpy as np
In [2]: v = np.random.rand(100000000)
In [3]: %timeit np.sum(v[v>0.5])
907 ms ± 1.39 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
If we use jit from numba, the time can be significantly reduced:
In [4]: from numba import jit
In [5]: @jit
...: def sum_selected(v: np.ndarray):
...: res_sum = np.float64(0.0)
...: for i in v:
...: if i > 0.5:
...: res_sum += i
...: return res_sum
...:
In [6]: %timeit sum_selected(v)
161 ms ± 2.53 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Now, here is the Rust version using iterators:
use rand::prelude::*;
use std::time::Instant;
fn main() {
let mut rng = rand::rng();
let arr: Vec<f64> = Vec::from_iter((0..100000000).map(|_| rng.random::<f64>()));
let t1 = Instant::now();
let sum: f64 = arr
.iter()
.filter(|&&x| x > 0.5f64)
.sum();
let t2 = t1.elapsed();
println!("{:8.4},{}", sum, t2.as_secs_f64());
}
The mean elapsed time is 0.22727 s ± 0.00361 s
. As you can see it is even slower than Python with jit.
I also tried looping over vector but got almost no difference compared to the
iterator version - 0.22353 s ± 0.00308 s
.
use rand::prelude::*;
use std::time::Instant;
fn main() {
let mut rng = rand::rng();
let arr: Vec<f64> = Vec::from_iter((0..100000000).map(|_| rng.random::<f64>()));
let mut sum: f64 = 0.0;
let t1 = Instant::now();
for el in arr.iter() {
if *el > 0.5f64 {
sum += *el;
}
}
let t2 = t1.elapsed();
println!("{:8.4},{}", sum, t2.as_secs_f64());
}
Both examples were built with the following options:
RUSTFLAGS="-C target-cpu=native -C target-feature=+avx2" cargo run --release
In emitted assembly I see that the code was vectorized but still it's slower than python with jit.
This C++ code is better by one order of magnitude than Rust version - 0.07285 s ± 0.00139 s
(I also benchmarked Fortran with similar to C++ results, but I will
omit the Fortran code).
#include <chrono>
#include <climits>
#include <cstdlib>
#include <ctime>
#include <iostream>
#include <stdfloat>
#include <vector>
#define N 100000000
int main()
{
std::srand(std::time(NULL));
std::vector<std::float64_t> arr;
for (size_t i = 0; i < N; ++i) {
arr.push_back(static_cast<std::float64_t>(std::rand())/RAND_MAX);
}
std::chrono::steady_clock::time_point t1 = std::chrono::steady_clock::now();
std::float64_t sum_res = 0.0f64;
for (auto el: arr) {
if (el > 0.5f64)
sum_res += el;
}
std::chrono::steady_clock::time_point t2 = std::chrono::steady_clock::now();
auto ellapsed = std::chrono::duration_cast<std::chrono::microseconds>(t2-t1).count();
std::cout << sum_res << "," << static_cast<std::float64_t>(ellapsed)/1000000 << std::endl;
}
Here is how I compiled the C++ code:
g++ -Wall -Wextra -Ofast -march=native -mtune=native -ffast-math -std=c++23 select_elements.cpp
Why is Rust variant slower than C++? Do you have any ideas how can I speed up the code? I am not an expert in Rust so maybe I am doing something wrong or miss important compiler options. I am aware there is simd
and intrinsics
modules but they require rust nightly and frankly speaking offer me to do vectorization manually in contrast to C++ where the work is delegated to the compiler.
Here is the versions of the software I used:
gcc version 14.2.1 20250207
rust 1.85.0
python-numba 0.61.0
python-numpy 2.2.3