It seems that the compiler can not vectorize the flat_map, but you can use a for loop instead.
use std::ops::Add;
use std::default::Default;
pub fn array_sums<T: Default+Copy+Add<Output=T>>(a: &[T], b: &[T]) -> Vec<T> {
let mut sums = vec![Default::default(); a.len()*b.len()];
for (chunk,&a) in sums.chunks_mut(b.len()).zip(a.iter()) {
for (sum,&b) in chunk.iter_mut().zip(b.iter()) {
*sum = b + a;
}
}
sums
}
I think my problem is that I have big arrays (like 8388608 positions), so, I have my arrays and if I create another variable I am collapsing the use of memory. I think I should use it as iterators, without the need to save space for the sums, do it as "lazy", but I do not know if it is possible.
for k in 0..BLOCK_SIZE {
for i in 0..BLOCK_SIZE {
for j in 0..BLOCK_SIZE {
let sum = kj[k * BLOCK_SIZE + j] + ik[i * BLOCK_SIZE + k];
let cell = &mut ij[i * BLOCK_SIZE + j];
if sum < *cell {
*cell = sum;
}
}
}
}
Have you looked at cache behavior? In your loops, ij and kj are accessed at stride 1, but ik is accessed at stride BLOCKSIZE. The compiler should compute it once in the i loop, so the large stride might not be an issue, but I'd check to make sure.