Optimize array sums

Hello !

I need to sum each element of the array a with each element of b. My first attempt is this:

let sums = a
        .iter()
        .flat_map(|ik| b.iter().map(move |kj| ik + kj))
        .collect::<Vec<_>>();

Do you have any idea how to improve performance on this?

Thank you !

It would help to pre-allocate the full capacity, because the iterator can't predict the number of items from flat_map.

a and b are slices with N elements. They were allocated before. Why the iterator could't predict ?

Because the iterator doesn't know what that closure is doing, only the type that it returns.

1 Like

With this happens the same, I think that this could be improved

    let mut sums = Vec::with_capacity(block_length);
    sums = a
        .iter()
        .flat_map(|ik| b.iter().map(move |kj| ik + kj))
        .collect();

Perhaps with an external crate ?

The collect creates a new collection, throwing away your reserved one. Instead, try sums.extend(a.iter().flat_map(...)).

2 Likes

You might be interested in https://docs.rs/itertools/0.9.0/itertools/macro.iproduct.html

With extend it's the same as before.

I tried with iproduct! but is is extremely slow, I don't know why

Are you building in --release mode? The default is debug mode, which can be orders of magnitude slower.

Yes ! With features for avx512

and with this?

extern crate num;
use num::Num;

fn sum_slices<T: Num + Copy>(slice1: &[T], slice2: &[T]) -> Vec<T> {
    slice1.iter().zip(slice2).map(|(&a, &b)| a + b).collect()
}

fn main() {
    let a = [1, 2, 3, 4, 5, 10, 10];
    let b = [2, 3, 45, 6, 6, 10, 10];

    let sum = sum_slices(&a, &b);
    println!("sum: {:?}", sum);

}

It seems that the compiler can not vectorize the flat_map, but you can use a for loop instead.

use std::ops::Add;
use std::default::Default;

pub fn array_sums<T: Default+Copy+Add<Output=T>>(a: &[T], b: &[T]) -> Vec<T> {
    let mut sums = vec![Default::default(); a.len()*b.len()];
    for (chunk,&a) in sums.chunks_mut(b.len()).zip(a.iter()) {
        for  (sum,&b) in chunk.iter_mut().zip(b.iter()) {
            *sum = b + a;
        }
    }
    sums
}

I think my problem is that I have big arrays (like 8388608 positions), so, I have my arrays and if I create another variable I am collapsing the use of memory. I think I should use it as iterators, without the need to save space for the sums, do it as "lazy", but I do not know if it is possible.

My goal is to turn this into efficient iterators.

for k in 0..BLOCK_SIZE {
    for i in 0..BLOCK_SIZE {
        for j in 0..BLOCK_SIZE {
            let sum = kj[k * BLOCK_SIZE + j] + ik[i * BLOCK_SIZE + k];
            let cell = &mut ij[i * BLOCK_SIZE + j];
            if sum < *cell {
                *cell = sum;
            }
        }
    }
}

Have you looked at cache behavior? In your loops, ij and kj are accessed at stride 1, but ik is accessed at stride BLOCKSIZE. The compiler should compute it once in the i loop, so the large stride might not be an issue, but I'd check to make sure.

The code with 3 fors works well, but I think that it isn't rust idiomatic. I feel that this could be improved a lot using the correct iterators.

Iterators are nice, but there's no reason to force it. Quite often explicit loops are easier to understand.

2 Likes

But what about bounding checks ? With iterators disappear, or not ?

It looks like BLOCKSIZE is a const, so the compiler should be able to skip the checks. Your should verify that, of course.