So, i have a code in Rust and in C that find the maximum number in a vector (in my case, the largest number are always in the last posisition). The thing is, i'm using linux time to measure these codes performance and C is giving the result 30x faster than Rust (real time), i'm quite sure that my code in C is working correctly, but as Rust is taking too long i think i'm doing something wrong.
I would like if someone could take a look and see if i'm doing something wrong with my Rust code, please !
i did not tested time of execution in playground, i compiled it in my ubuntu (i5, 4 cored) machine.
2 . o tried to use Rayon, but i think i dont know how to use it properly. I try do normally compile like any Rust program, and i keeps giving a error that cant find crate.
You are cloning the vector in the Rust version. Borrow a slice instead. There is no reason to clone the vector. That's a lot of heap allocation that the C version is not doing.
You can’t borrow a slice of the vector here because thread::spawn needs a 'static closure, meaning it’s not borrowing anything (other than static refs, but that’s inapplicable to this case). That’s why the Arc is there.
No, you are correct. I missed that you renamed/re-wrapped vetor in an Arc. You'd be better to use split_at and borrow the sub-set of the vector for each thread that that instance needs. Less overhead. Using Arc like this seems counter-productive.
For better comparison, I modified the code slightly, and now have three methods doing the same, one single-threaded, one using rayon, one explicitly multi-threaded. The explicit version is by far the fastest!
fn single_threaded(data: Arc<Vec<u8>>) -> u8 {
data.iter()
.fold(std::u8::MIN, |m, elem| std::cmp::max(m, *elem))
}
fn rayon_based_mt(data: Arc<Vec<u8>>) -> u8 {
data.par_iter()
.reduce(|| &std::u8::MIN, |a, b| std::cmp::max(a, b))
.clone()
}
fn improved_original(data: Arc<Vec<u8>>) -> u8 {
//------------------------------------------------
let mut threads = vec![];
for i in 0..N_THREADS {
let v = data.clone();
threads.push(thread::spawn(move || {
let low = i * (ELEMENTS / N_THREADS);
let high = if (i == N_THREADS - 1) && ((ELEMENTS % N_THREADS) != 0) {
ELEMENTS
} else {
low + ELEMENTS / N_THREADS
};
v[low..high]
.iter()
.fold(std::u8::MIN, |m, elem| std::cmp::max(m, *elem))
}));
}
threads
.into_iter()
.map(|t| t.join().unwrap())
.max()
.unwrap()
}
On quite a strong laptop, with release build, I got these results (ELEMENTS = 15_999_999_999):
single_threaded version took 1840 ms
rayon_based_mt version took 1497 ms
improved_original version took 721 ms
Unless I chose a bad implemention for rayon, one can say that optimizing the array evaluation by using iterators and adaptors makes the biggest effect. DIstributing work explicitly is awful, but can pay off.
For cases where the actual work (cmp::max) is very small, I would suggest telling Rayon not to divide it too far, something like data.par_iter().with_min_len(1_000_000).
My guess is that it's currently getting lost in the overhead of trying to adaptively split the array way down. Rayon doesn't know how "heavy" the work actually is, or whether each item has proportional amounts of work, etc. Maybe we could be a little smarter about this, but it's hard to generalize.
Side note, Iterator and ParallelIterator both have max() methods already.
The tuning with with_min_len() did only allow a slight improvement.
The built-in max() functions are surprisingly much slower:
single_threaded version took 1900 ms
single_threaded_max version took 6336 ms
rayon_based version took 1423 ms
rayon_based_max version took 2282 ms
rayon_based_tuned version took 1312 ms
improved_original version took 652 ms
Original version took 2919 ms