Coding Best Practise - Adding two vecs

I am having two vecs of same size. Which is the right way to do?

for a in a.iter_mut().zip(b) {
    *a += b;
}

Is this a quickest and optimised way? Note: I tried par_iter didn't work

1 Like

You could also replace the for loop:
a.iter_mut().zip(b).for_each(|(a,b)| *a += *b);

This should compile to basically the same code as yours.
Since the code is relative simple, it should also be quite optimal for a single thread.

Depending on the number of elements, multiple threads might be beneficial and I think iter_par is the way to go. Unfortunalty I'm on my phone right now, so I can't really test code.

I'd do something like this, yeah:

for (a, b) in a.iter_mut().zip(b.iter()) {
    *a += *b;
}

Assuming that matches your post's intention, I think it should be pretty good. The compiler should be able to use SIMD operations here.

If you want to parallelize, I think it might be better to use rayon::join rather than par_iter or par_iter_mut. I haven't benchmarked it, so I don't have concrete evidence. But par_iter goes through each element individually, and even if it bunches them together I would expect there to be overhead executing the function on each element when the computation is so small. join allows you to control where you stop splitting the task into different tasks, so you can do a reasonable number of additions in sequence and keep a good cache-friendly access pattern.

This is what I'd write to do this in parallel. Feel free to mess around with CHUNK_SIZE and see what results in the best speed on your hardware, and to compare this with the naive par_iter and non-parallel versions. I'd be interested to see the results!

use std::ops::AddAssign;

/// Play around with different CHUNK_SIZEs as needed to find the best one.
const CHUNK_SIZE: usize = 4 * 1024;

pub fn par_add_all<T: Send + Sync + AddAssign + Copy>(target: &mut [T], src: &[T]) {
    assert_eq!(target.len(), src.len());

    if target.len() < CHUNK_SIZE {
        for (a, b) in target.iter_mut().zip(src.iter()) {
            *a += *b;
        }
    } else {
        let midpoint = target.len() / 2;
        let (left_target, right_target) = target.split_at_mut(midpoint);
        let (left_src, right_src) = src.split_at(midpoint);
        rayon::join(
            || par_add_all(left_target, left_src),
            || par_add_all(right_target, right_src),
        );
    }
}

If you're interested in this parallelism strategy, I think it's called the divide-and-conquer algorithm. It allows us to only spend O(log(n)) splitting up the task into manageable chunks, and then to execute each of those chunks on potentially different threads.

1 Like

I think you got something wrong there.
The "tree" of splits is only O(log(n)) deep, but there still are O(n) splits (more precisely in this case something around n/CHUNK_SIZE).

Also, it should be possible to do this without recursion, but otherwise your answer seems spot on to me :slight_smile:

Right! It's only O(log(n)) assuming infinite parallelism, which we don't have. Thanks for catching that.

The code which I am planning to port has lot of array manipulation. I am not sure adding this code will make it look more complex :neutral_face:

Maybe is there any crate available in crates.io which can do these computations?

You could use ndarray.
That crate tries to be something like bumpy for rust.
I haven't used in in quite some time, so I can't really help on how to use it..

Regarding your original quedtion, for me this compiles in the playground:

use rayon::prelude::*;
fn test(a: &mut[i32], b: &[i32]){
  a.par_iter_mut().zip(b).for_each(|(a,b)| *a += *b);
}

Regarding what daboross wrote about par_iter maybe not being ideal here: the docs say that it "automatically adapts for optimal performance" whatever that means. If you are up for some benchmarking you could compare against this:


use rayon::prelude::*;
fn test(a: &mut[i32], b: &[i32]){
  a.par_chunks_mut(32).zip(b.par_chunks(32)).flatten().for_each(|(a,b)| *a += *b);
}
1 Like

It'd definitely make it more complex - but maybe that complexity could be abstracted out?

I think it depends on your goals, whether parallelism is actually necessary, and whether something like what I wrote is actually that much better than the solution using par_iter_mut().

Do you have an existing codebase which is too slow that you're trying to make it faster - or are you writing code from scratch, and just want to use best practices?

Thanks. Maybe I can think of this abstracting this. I wanted to have minimal code so that I can prove moving the different codebase to Rust is useful in performance.

If it can be abstracted out really good (the complexity) I would go with this statement. But still you should try to minimize it. For example, I would guess my second code example should be as fast if not faster than your one (I didn't benchmark it and I'm ashamed, but I'm on mobile) and I would say is less complicated.

Agreed here! The only reason I went with my version was because I believe it will give better performance in general, due to reasonable sized chunks rather than having to manage each individual addition as a separate bit of work.

On the topic of benchmarking, I've written some, but I'm afraid my dual-core laptop is not beefy enough to benefit from the parallelism at all. My results adding some simple 320 KiB arrays are:

par_add_all_join        time:   [11.485 us 11.670 us 11.911 us]
Found 16 outliers among 100 measurements (16.00%)
  3 (3.00%) high mild
  13 (13.00%) high severe

seq_add_all             time:   [2.0696 us 2.0930 us 2.1172 us]
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe

par_add_all_par_iter    time:   [21.167 us 21.571 us 22.014 us]
Found 9 outliers among 100 measurements (9.00%)
  8 (8.00%) high mild
  1 (1.00%) high severe

The benchmark repo is here, if either of you has a bigger computer to test it on: GitHub - daboross/rayon-array-add-benchmarks: Benchmarks 3 versions of an add_all function.

I will give it a shot :slight_smile:

From my experience those 1024*10 elements are nowhere near enough to get benefit from parallelization (if all you do is addition and have sse or even avx available).

I edited your benchmarking code a bit, and am now running the benchmarks. If every benchmark requires around 10 seconds, the results should be here in... around 800 seconds :slight_smile:

Mh, criterion wasn't creating fancy plots, fixed it and restarted it.

Thanks!

Worth noting I pushed an update with some bigger size to the repository. I ran those on my computer, but with anything from a ~300 MiB to GiB sized array, I get minimal speedup from either parallel version, and about equal performance between the two.

If you want I can create a pull request with my changes

If you want, feel free! I'd be fine just seeing what you get for results too though - I don't think running anything different on my computer is going to result in anything interesting.

Results:

Which kind of makes sense, assuming that adding two vecs is a memory bound task and not a cpu bound task :sweat_smile:

My conclusion from this is: use the sequential single threaded one. It's the simplest and in most situations the fastest one. It also has the additional benefit that, if you have other stuff to do, you still have unused cpu cores left :slight_smile:

Edit:
Also, the sequential version profits most from additional compiler options, for example target-cpu=native


Might be difficult to see, but as long as it isn't memory bound (in my case up to around 8k, depends on the cache size), the sequential version is roughly two times faster than before.

Edit:
I wondered why the parallel chunk version didn't perform well, and it seems a parallel iterator over slices doesn't like to get flattened (obviously doesn't change that the sequential version is still faster):

2 Likes

Thanks @raidwas and @daboross for this detailed plots :slight_smile:

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.