Parallelizing with a mutable input

Hi everyone, I am a beginner in Rust and looking for a bit of help regarding my code. I am trying to write high-performance code for a specific linear algebra operation (sparse-dense matrix multiplication for a custom sparse layout), but running into some issues regarding parallelization.

I have this rather straightforward function

use nalgebra::{DMatrix, DVector};

type Matrix = DMatrix<f32>;
type Vector = DVector<f32>;

fn diamul(ks: &[isize], xs: &[Vector], y: &Matrix, z: &mut Matrix) {
    xs.iter()
        .zip(ks.iter())
        .for_each(|(x, &k)| _single_mul(k, x, y, z));
}

fn _single_mul(k: isize, x: &Vector, y: &Matrix, z: &mut Matrix) {
    // roughly implementing z += x[:, None] * y[:, :] in NumPy-like form
}

which works as intended. For performance, I would like to convert the iteration into

xs.par_iter()
    .zip(ks.par_iter())

using rayon. However, this raises the following error:

error[E0596]: cannot borrow `*z` as mutable, as it is a captured variable in a `Fn` closure

I understand this is due to multiple threads wanting do act upon z which is mutable, and that could create conflicts. However, here, I just want to accumulate the result of independent computations into z, which should be parallelizable without problem. Is there a way I could achieve this?

I am providing the full code below in case that's useful. Thank you!

use criterion::{black_box, criterion_group, criterion_main, Criterion};
use rand::prelude::*;
use rayon::prelude::*;
use nalgebra::{DMatrix, DVector};

const N: usize = 1024;
const D: usize = 3;
type Matrix = DMatrix<f32>;
type Vector = DVector<f32>;

fn _get_slice(k: isize) -> (usize, usize, usize) {
    if k >= 0 {
        let idx_in = k as usize;
        let idx_out = 0;
        let size = N - (k as usize);
        (idx_in, idx_out, size)
    } else {
        let idx_in = 0;
        let idx_out = -k as usize;
        let size = N - (-k as usize);
        (idx_in, idx_out, size)
    }
}

fn _single_mul(k: isize, x: &Vector, y: &Matrix, z: &mut Matrix) {
    let (idx_in, idx_out, size) = _get_slice(k);
    let _x = x.rows(idx_in, size);
    let _y = y.columns(idx_in, size);
    let mut _z = z.columns_mut(idx_out, size);

    _z.column_iter_mut()
        .zip(_y.column_iter())
        .zip(_x.iter())
        .for_each(|((mut z_col, y_col), x_val)| {
            z_col.axpy(*x_val, &y_col, 1.0);
    });
}

fn diamul(ks: &[isize], xs: &[Vector], y: &Matrix, z: &mut Matrix) {
    xs.par_iter()
        .zip(ks.par_iter())
        .for_each(|(x, &k)| _single_mul(k, x, y, z));
}

fn benchmark_diamul(c: &mut Criterion) {
    // Generate random diagonals and offsets, and other matrix
    let mut rng = rand::thread_rng();
    let ks = vec![-1, 0, 1];
    let xs = (0..D)
        .map(|_| Vector::from_fn(N, |_, _| rng.gen::<f32>()))
        .collect::<Vec<_>>();
    let y = Matrix::from_fn(N, N, |_, _| rng.gen::<f32>());
    let mut z = Matrix::zeros(N, N);

    c.bench_function("diamul", |b| {
        b.iter(|| {
            diamul(black_box(&ks), black_box(&xs), black_box(&y), black_box(&mut z));
        })
    });
}

criterion_group!(benches, benchmark_diamul);
criterion_main!(benches);

If you want to use rayon and safe code straightforwardly, you have to express this as a parallel iteration over parts of z. That doesn't look directly possible.

You could express your accumulation as a reduction: start with xs.par_iter().zip(ks.par_iter()), and call fold() on that to construct and mutate a matrix, then reduce them by summing or whatever. The cost of this is that you have to allocate several output matrices, but in exchange, you can freely access them whatever.

If you want to take an unsafe approach, then you need to either work with raw pointers to the contents of z, or make z's element type an UnsafeCell<f32> which is accessed through &Matrix (no mut). Then, as long as the accesses to z actually do not overlap at all, you can run the code you already tried. But then your program has undefined behavior if its computations of indices have a mistake causing overlap.

A brute force but safe approach is to make the element type AtomicU32 (interpreted via f32::from_bits()). This way you pay the larger cost of atomic operations but can safely random-access them. This might be appropriate if the elements of z actually written to are very sparse.


By the way, your code has many names with leading _ that should not be there. Leading _ has a special meaning in Rust: it means “this item may be unused, but needs to exist anyway” — it's sort of a shorthand for an #[allow(unused)] attribute. It should not be used for any other purpose. In particular, “this is private/internal” is indicated by the lack of pub.

1 Like

Thanks a lot for your rapid answer. I have tried this by modifying the function single_mul from

fn single_mul(k: isize, x: &Vector, y: &Matrix, z: &mut Matrix):
    // ...

to

fn single_mul(k: isize, x: &Vector, y: &Matrix) -> Matrix:
    let mut z = Matrix::zeros(N, N);
    // ...

but this seems to slow down the overall implementation by a lot, probably because of these new memory allocations (from ~350us to ~1.25ms). Is there a way around it? Maybe the way I use fold is not correct either, I'm not sure:

xs.iter()
    .zip(ks.iter())
    .fold(z, |acc, (x, &k)| {
        *acc += single_mul(k, x, y);
        acc
    });

You're not doing any parallelism there. I meant that you should use par_iter() and Rayon's fold, and Matrix::zeros(N, N) should be the accumulator value of the fold, not allocated inside of every single_mul(). Something like this:

xs.par_iter()
    .zip(ks.par_iter())
    .fold(
        || Matrix::zeros(N, N), // this is called roughly once per thread
        |mut acc, (x, &k)| {
            acc += single_mul(k, x, y);
            acc
        })
    .reduce(|| Matrix::zeros(N, N), |z1, z2| z1 + z2);

I haven't tried to compile this code, so it probably contains an error.

(The final reduce() is needed because ParallelIterator::fold() produces multiple outputs, due to the work being split into parallel jobs without fold() knowing how to recombine them, so the reduce() does that. The documentation for fold() talks about this.)

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.