Assign to an array in parallel


#1

I’d like to compute a gradient on a n-dimensional field in parallel, using rayon and ndarray. In order to assign to an element at an index, I need access also neighbouring indices.

A contrived 1D example, what I’m trying to achieve looks like that

extern crate rayon;
extern crate ndarray;

use rayon::prelude::*;
use ndarray::Array;

fn main() {
    let a = vec![1,2,3,4,5,6,7,8,9,10];
    let mut b = vec![0; 10];

    (1usize..8)
         .into_par_iter()
         .map(|i|
              b[i] = a[i+1] - a[i-1]
        );

     println!("array: {:?}", b);
 }

Which fails miserably with

error[E0387]: cannot borrow data mutably in a captured outer variable in an Fn closure

. How to achieve that?


#2

You can iterate mutably over b instead of iterating over the indices. That is, b.par_iter_mut().enumerate()....


#3

Note that there might be little benefit in carrying out this operation in parallel, because as presented the gradient computation would be memory bound and could well saturate the CPU’s memory bus with a single CPU core.


#4

If I understand correctly, .enumerate() isn’t that helpful, when dealing with multi-dimensional arrays, because it just increments. I’d like to do that for an n-dim array and thus was using indices.


#5

Yes, that might be possible. Because of this, I wanted to benchmark my code, if parallelization helps in this case. Because most time in my code is spent to build a gradient field for rather big 2D-arrays.


#6

If parallelization does not help, perhaps another path to improvement would be to try computing a lower-resolution gradient and interpolating it. We use that trick at work in order to sidestep the memory bottleneck of gradient computations in scenarios where maximal derivative precision is not actually needed.

Heck, the difference between FLOPs and memory bandwidth is so high these days that for smooth signals, even using cubic interpolation can be beneficial over linear interpolation, as long as it allows you to drop a couple more millions of data points from the computation :slight_smile:


#7

There’s also ndarray-parallel built on rayon, which has custom mutable iterators.


#8

This crate does not yet implement par_indexed_iter, but I think bluss is on it https://github.com/bluss/rust-ndarray/issues/279. There I’ve posted the same question and he suggested, that SIMD might be more helpful in this case.

SIMD is on my TODO list anyway.


#9

Hmm, interesting idea. I’ll keep that in mind. Need to think about, what is the implication on precision. Which is mandatory in my case.


#10

In case anybody is interested. I’ve ended up writing my code in a single threaded way, that can be easily auto-vectorized and gained an order of magnitude in micro-benchmarks.

The code looks something like that, for a central difference quotient (neglecting prefactors) with periodic boundaries (i.e. wrap around) for a 2D array in x-direction:

let mut res: Array<f64, Ix4> = Array::zeros((sx, sy));

// bulk
{
    let mut s = res.slice_mut(s![1..-1, ..]);
    s += &field.slice(s![2.., ..]);
    s -= &field.slice(s![..-2, ..]);
}
// borders, wrap around
{
    let mut s = res.slice_mut(s![..1, ..]);
    s += &field.slice(s![1..2, ..]);
    s -= &field.slice(s![-1.., ..]);
}
{
    let mut s = res.slice_mut(s![-1..,..]);
    s += &field.slice(s![..1, ..]);
    s -= &field.slice(s![-2..-1, ..]);
}

To gain, a little more, I even used unsafe code to allocate uninitialized code, instead of zeroing it. But that did not made such a difference in my case.