Turning mutable grid access into a thread-safe grid access (stencils)

I have the following SyncedGrid which has thread-safe Cells.

struct Cell {
    i: i64,
}

type LockedCell = Mutex<Cell>;

struct SyncedGrid {
    cells: Vec<Arc<LockedCell>>,
}

However in my code I would like to use a Grid

struct Grid {
    cells: Vec<Cell>,
}

and first move (wrapping) it efficiently with no copies into a SyncedGrid to use by parallel threads.
Is there an efficient way of doing something like


grid = Grid { cells: vec![ Cell{ i: 1}, Cell { i: 3 }] }

grid_synced = wrap_to_parallel_grid(grid
do_parallel_computing(&grid_synced)
grid = unwrap_parallel_grid(grid_synced)

I guess wrap_to_parallel_grid needs to allocate memory again to copy the cells into the Arc<Mutex<...>>.
Or can I somehow provide another SyncedGrid (proxy trait) which only references the moved original grid but that can return locked Cells in the form Arc<Mutex<&Cell>>.

If you tried to have a mutex on each individual Cell, it'd probably kill performance. There's no cheap way to convert between Vec<Mutex<Cell>> and Vec<Cell>.

vec.chunks_mut() will give you independent slices that each thread can work on. This can be combined with scoped threads to do the work without any further copying/locking/refcounting (&mut is an exclusive lock, so you don't need Mutex if you have a &mut).

And rayon does this as a one-liner:

so you can have Vec<Cell> and do:

cells.par_iter_mut().for_each(|cell| do_computing_on_one(cell))

or

cells.chunks_mut(batch_size)
  .into_par_iter()
  .for_each(|some_cells| do_computing_on_a_batch(some_cells))

the latter works better if computing is cheap and SIMDable.

BTW: watch out for naming collision with std::cell::Cell which is inherently single-threaded.

Thanks thats a very good answer, but I think it might not apply when I need to work not embarassingly parallel, the thread will need to lock neighbor cells etc.

But probably I can color the graph cells into different independent colors (distance-1) coloring
put all cells with the same color in one batch and use Vec<>.get_mut_many(...) or something with rayon.
But that is somehow flawed because Vec<>.get_mut_many(...) needs a compile time-sized array?

I mean that funcion: slice - Rust

get_many_mut won't be useful. That's for vec[a], vec[b], vec[c] as get_many_mut(a,b,c). There's split_at_mut and chunks_mut for variable-sized slices. Indexing individual &mut cells directly won't be possible across threads, because you can't share a single mutable object across threads - you have to split it (or lock).

If you need access to neighboring cells, can you do them e.g. in pairs or groups of 3 rows?

let rows = cells.chunks_mut(row_size).collect::<Vec<_>>();
rows.par_chunks_mut(3).for_each(|three_rows| …);

Locking neighboring cells is likely to cause a lot of lock contention, and a lot of cache trashing between threads, so it's best to split data into large separate chunks exclusive to each thread. Even if you need to work on e.g. only odd ones first, then even ones in a second pass.

If the cells are numeric, and you can't split the data, use AtomicI32 instead of Mutex<i32>, and there's windows() iterator that may be handy for accessing neighbors in an iterator.

Thanks, ok.

The problem is, in this 2D Grid, for a parallel func(a, nb1, nb2) I need to get 3 cells: a and two neighbors (the next in x and the next in y direction).
They are all inside cells: Vec<Cell>. How to assemble these 3 tuples of (&mut Cell, &mut Cell, &mut Cell) inside a Vec<(&mut Cell, &mut Cell, &mut Cell)> with one
iterator pass over grid (?), split_at_mut etc is really hard, I cannot wrap my head around.

I can think of a solution with Mutex etc. where I have two passes such that there is no lock contention (basically not needing the mutex, but the solution with &mut is so hard but would be soo nice.

Basically if I could end up with a Vec

[&mut a, &mut n1, &mut n2,     ....    &mut b, &mut n1, &mut n2,    ...]
 **** --------------------------------------------------*****

there definitely are duplicated mutable refs: which is kind of wrong if rusts borrow checker would somehow allow this?

I guess that assembly will not be possible with the slice/std machinery
because in other words that would mean I would violate rusts borrow rule?

Its funny because I can actually do

let mut refs: [Vec<&mut Cell>; 2] = [vec![], vec![]];

grid.cells.iter_mut().enumerate().for_each(|e| {
    let index = e.0 % 2;
    refs[index].push(e.1);
});

to split the mutable refs. But somehow duplicating them like I need is kind of impossible I guess.
Its really magic how the borrow checker prohibits this in al sorts of way when I try.

One of the cornerstones of Rust's ownership and borrowing foundation is that &muts are not aliased -- they are exclusive borrows -- something else being able to observe what you can mutate behind a &mut is instant UB.

So two active &mut to the same data at the same time would be UB, and safe Rust has no UB. That's why the compiler keeps finding ways to keep you from doing that.


The choice of Cell for your custom struct is a little unfortunate, as that's a standard library type. It's one that features "interior mutability" or "shared mutability" -- the ability to mutate through a shared reference. (It's built upon a special UnsafeCell that does not have the same aliasing constraints as &mut.)

Perhaps what you need is to go from &mut to &std::cell::Cell. std::cell::Cell is not Sync though, so you'd have to do this conversion after farming out non-overlapping &mut [T] slices to different threads.

Given what you describe though, I'm not sure that will work; you may not be able to do the non-overlapping slices simultaneously.


You say you need to work on a, x, and y at the same time:

diag:   0   1   2   3
       /   /   /   /
    +---+---+---+---+
    |   | a | x |   | ...    a is in diagonal 1
    +---+---+---+---+        x and y are in diagonal 2
    |   | y |   |   | ...
    +---+---+---+---+
      :

Is data uniformally flowing in any particular direction, e.g. a is updated based on x and y but (during this update) x and y don't change? Or vice-versa?

If so, there may be a way to arrange your data into a sort of pipeline, where diagonal k is handled a shared reference to diagonal k+1, diagonal k updates itself, the diagonal k is used to update diagonal k-1.

I've built such a thing with a lending iterator pattern; I'd have to ponder how to parallelize it.

1 Like

Thanks, very enlightening!

lets call my unfortunate choice of Cell as C for now.

Your picture is correct. Basically you can split this task of iterating over all cells into a checkerboard, the cells which are black can be done in parallel the others which are white in another sweep also in parallel (distance-1 neighborhood). So considering this and your input, might the following work:

My computation is from : GitHub - gabyx/RustoFluid: An Eulerian fluid simulation written in Rust to learn the language

where I have to compute the two things for a:

  1. divergence by a.div = x.u - a.u + y.v - a.v
  2. update a.u, a.v and x.u and y.v

Method 1

  • I convert the whole grid.cells into two destinct parallel sets of references black and white.
  • Run 1): parallel over the mutable black references: 1 black reference a uses the list of all white immutable references cells to get its correct neighbors and does the update a.div and a.u and a.v from 2)
  • Run the remaining stuff from 2): One black reference a uses to whole white mutable references list to get the neighbors to update x.u and y.v.

Not sure if that works.

That certainly seems plausible. If you go with this approach, it might make sense to store your data as two separate grids, one for the black cells and another for the white:

black                               white
+-----+-----+-----+-----+           +-----+-----+-----+-----+
| 0,0 | 0,2 | 0,4 | 0,6 | ...       | 0,1 | 0,3 | 0,5 | 0,7 | ... 
+-----+-----+-----+-----+           +-----+-----+-----+-----+
| 1,1 | 1,3 | 1,5 | 1,7 | ...       | 1,0 | 1,2 | 1,4 | 1,6 | ...       
+-----+-----+-----+-----+           +-----+-----+-----+-----+
| 2,0 | 2,2 | 2,4 | 2,6 | ...       | 2,1 | 2,3 | 2,5 | 2,7 | ... 
+-----+-----+-----+-----+           +-----+-----+-----+-----+
| 3,1 | 3,3 | 3,5 | 3,7 | ...       | 3,0 | 3,2 | 3,4 | 3,6 | ...       
+-----+-----+-----+-----+           +-----+-----+-----+-----+
   :                                   :
   :                                   :

That way, &white can be shared with concurrent updates of &mut black[...] and vice versa.

2 Likes

I did some rough tests to see how it plays out:

use std::time::Duration;

use criterion::{criterion_group, criterion_main, BenchmarkId, Criterion};
use rayon::prelude::*;

#[derive(Clone, Debug)]
struct C {
    i: usize,
}

#[derive(Clone, Debug)]
struct G {
    cells: Vec<C>,
}

fn run_grid_single(n: usize, grid: &mut G) {
    let l = grid.cells.len() - 1;

    for _ in 0..n {
        for i in 0..grid.cells.len() {
            grid.cells[i].i += i + grid.cells[(i + 1).min(l)].i
        }
    }
}

fn run_grid_parallel(n: usize, grid: &mut G) {
    let mut black: Vec<&mut C> = vec![];
    let mut white: Vec<&mut C> = vec![];

    grid.cells.iter_mut().enumerate().for_each(|(i, c)| {
        if i % 2 == 0 {
            black.push(c);
        } else {
            white.push(c)
        }
    });

    for _ in 0..n {
        let mut l = white.len() - 1;
        black
            .par_iter_mut()
            .enumerate()
            .for_each(|(i, c)| c.i += i + white[i.min(l)].i);

        l = black.len() - 1;
        white
            .par_iter_mut()
            .enumerate()
            .for_each(|(i, c)| {
            c.i += i + black[i.min(l)].i;
        });
    }
}

fn criterion_benchmark(c: &mut Criterion) {
    let mut grid = G {
        cells: vec![C { i: 3 }; 100 * 100],
    };

    let mut group = c.benchmark_group("Grid Single vs. Parallel");
    for i in [10,50,100].iter() {
        group.bench_with_input(BenchmarkId::new("Single", i), i, |b, i| {
            b.iter(|| run_grid_parallel(*i, &mut grid))
        });

        group.bench_with_input(BenchmarkId::new("Parallel", i), i, |b, i| {
            b.iter(|| run_grid_single(*i, &mut grid))
        });
    }

    group.measurement_time(Duration::from_secs(20));
    group.finish();
}

criterion_group!(benches, criterion_benchmark);
criterion_main!(benches);

1 Like

You may be interested in this article which created a custom abstraction that allows selectively accessing non-contiguous parts of memory in parallel:

https://blog.rom1v.com/2019/04/implementing-tile-encoding-in-rav1e/

1 Like

Another possibility is to arrange your data in the order it will be accessed:

                     Typically stored as:
+---+---+---+---+    +---+---+---+---+---+---+
| A | B | C | D |    | A | B | C | D | E | F | ...
+---+---+---+---+    +---+---+---+---+---+---+
| E | F | G | H |    Store by diagonal instead:
+---+---+---+---+    +---+---+---+---+---+---+
| I | J | K | L |    | A | E | B | I | F | C | ...
+---+---+---+---+    +---+---+---+---+---+---+

This may enable using slice iterators and increase cache locality during calculation. (But is a tradeoff if you need to access row or column wise later / more complicated index calculation / just not what we're used to.)

...I guess implicit in what I just wrote is that you would be updating a diagonal at a time on one thread, versus calculating every cell independently, for instance. I imagine coordination curtails the benefit of parallelization at some point (but of course, always measure).

Oh very interesting, I think I just stumbled over this issue that I try hard to somehow slice my
Vec<C>in the way that I can then parallel work on the items. But its just really hard.
I am not sure why this checkboard splitting even worked:

    let mut black: Vec<&mut C> = vec![];
    let mut white: Vec<&mut C> = vec![];

    grid.cells.iter_mut().enumerate().for_each(|(i, c)| {
        if i % 2 == 0 {
            black.push(c);
        } else {
            white.push(c)
        }
    });

My parallel work stencil lookes basically as:

+---+  
| Y | 
+---+---+
| A | X | 
+---+---+

so I thought of generating lots of these (non-overlapping) items (stencils)

struct Stencil {
    cell: &mut C,
    nbs: Vec<&mut C>,
}

But the problem is just

let cells = vec![ ..... ]

cells.iters().iterate_by_L_stencil_over_grid()
.par_iter().chunks(3).map(|w| {
            let a: &mut C = w[0];
            let x: &mut C = w[1];
            let y: &mut C = w[2];
            return Stencil {
                cell: a,
                nbs: vec![x, y],
            };
})

But I am kind of out of newbie-luck to come up with this iterator which iterates differently through the 1-d cells which store all grid cells in column-major order. I think that just does not work.

I assume you have a 2d grid. In that case i % 2 == 0 is not a checkerboard, unless it happens to have an odd width. You'll need (x ^ y) & 1 == 0.


            let a: &mut C = w[0];
            let x: &mut C = w[1];
            let y: &mut C = w[2];

This is where Rust is not smart enough, and can't understand that 0 1 2 are different. The first use of w is exclusive for all of w. That's why get_many_mut exists, or:

let [a, x, y] = w else { panic!("bad length") };
nbs: Vec<&mut C>,

This is technically valid, but extremely inefficient. If there are always 2 elements, use [&mut C; 2].

1 Like

Totally agree, the example was not really correctly implemented. Just a try&error thing

I have now with the idea from https://blog.rom1v.com/2019/04/implementing-tile-encoding-in-rav1e/
(also unsafe but ok probably) implemented the Stencil solution, it
compiles
but I cant get it to work with rayon, not sure what the Problem is with the slice struct PosStencilMut

use criterion::{
    criterion_group, criterion_main, AxisScale, BenchmarkId, Criterion, PlotConfiguration,
};
use itertools::Itertools;
use rayon::prelude::*;
use std::{marker::PhantomData, time::Duration};

type Index2 = nalgebra::Vector2<usize>;

#[macro_export]
macro_rules! idx {
    ($x:expr, $($y:expr),+ ) => {
        Index2::new($x, $($y),+)
    };
}

#[derive(Clone, Debug)]
pub struct C {
    pub i: usize,
    pub index: Index2,
}

#[derive(Clone, Debug)]
pub struct G {
    pub cells: Vec<C>,
    pub dim: Index2,
}

pub trait GetIndex {
    fn index(&self) -> Index2;
}

impl GetIndex for C {
    fn index(&self) -> Index2 {
        return self.index;
    }
}

pub struct PosStencilMut<'a, T: 'a>
where
    T: GetIndex,
{
    cells: *mut T,
    dim: Index2,
    phantom: PhantomData<&'a mut T>,
}

impl<'a, T: GetIndex> PosStencilMut<'a, T> {
    fn cell(&self) -> &'a T {
        return unsafe { &*self.cells };
    }

    fn cell_mut(&mut self) -> &'a mut T {
        return unsafe { &mut *self.cells };
    }

    fn neighbor(&mut self, dir: usize) -> Option<&'a mut T> {
        assert!(dir <= 1 && self.dim.x >= 1 && self.dim.y >= 1);

        if self.cell().index()[dir] >= self.dim[dir] - 1 {
            return None;
        }
        let offset = if dir == 0 { 1 } else { self.dim[0] };
        return Some(unsafe { &mut *self.cells.add(offset) });
    }
}

pub fn pos_stencils_mut<T: GetIndex>(
    data: &mut Vec<T>,
    dim: Index2,
    min: Option<Index2>,
    max: Option<Index2>,
) -> impl Iterator<Item = PosStencilMut<'_, T>> {
    let min = min.unwrap_or(Index2::zeros());
    let max = max.unwrap_or(dim - idx!(1, 1));

    assert!(min >= Index2::zeros() && max < dim);

    return (min[0]..max[0])
        .step_by(2)
        .cartesian_product((min[1]..max[1]).step_by(2))
        .map(move |(i, j)| PosStencilMut {
            cells: &mut data[i + j * dim[0]],
            dim,
            phantom: PhantomData,
        });
}

fn run_grid_parallel(n: usize, grid: &mut G) {
    let mut stencils: Vec<_> = pos_stencils_mut(&mut grid.cells, grid.dim, None, None).collect();

    for _ in 0..n {
        stencils.iter_mut().for_each(|stencil| {
            stencil.cell_mut().i += match stencil.neighbor(0) {
                Some(c) => c.i,
                None => 0,
            } + match stencil.neighbor(1) {
                Some(c) => c.i,
                None => 0,
            }
        })
    }
}

fn run_grid_single(n: usize, grid: &mut G) {}

fn criterion_benchmark(c: &mut Criterion) {
    let dim = idx!(100, 100);

    let cell_generator = (0..dim[0]).cartesian_product(0..dim[1]).map(|(i, j)| C {
        i: i + j,
        index: idx!(i, j),
    });

    let mut grid = G {
        cells: Vec::from_iter(cell_generator),
        dim,
    };

    let mut group = c.benchmark_group("Grid Single vs. Parallel");
    for i in [10, 40, 100, 1000].iter() {
        group.bench_with_input(BenchmarkId::new("Single", i), i, |b, i| {
            b.iter(|| run_grid_parallel(*i, &mut grid))
        });

        group.bench_with_input(BenchmarkId::new("Parallel", i), i, |b, i| {
            b.iter(|| run_grid_single(*i, &mut grid))
        });
    }

    let plot_config = PlotConfiguration::default().summary_scale(AxisScale::Linear);
    group.plot_config(plot_config);

    group.measurement_time(Duration::from_secs(20));
    group.finish();
}

criterion_group!(benches, criterion_benchmark);
criterion_main!(benches);

When I add

unsafe impl<'a, T: GetIndex> Send for PosStencilMut<'a, T> {}
unsafe impl<'a, T: GetIndex> Sync for PosStencilMut<'a, T> {}

it compiles also with par_iter_mut(). Is that correct?

You're going in the right direction, but there are some issues.

In Rust there's a very important rule that there can't be, under any circumstances, even for a second, even in badly-written code, two usable &mut references to the same place.

Your implementation breaks this rule by allowing [stencil.cell_mut(), stencil.cell_mut()]. This is because you have:

fn cell_mut(&mut self) -> &'a mut T

instead of:

fn cell_mut(&mut self) -> &mut T
// same as:
fn cell_mut<'just_this_one_call>(&'just_this_one_call mut self) -> &'just_this_one_call mut T

When &'a mut T has lifetime separate from self, it is allowed to be used independently of self as if it borrowed directly from somewhere else denoted by 'a, bypassing your PosStencilMut abstraction. If you remove the lifetime, then it borrows from PosStencilMut and "locks" it exclusively while the returned reference is in use.

BTW the third option of fn cell_mut(&'a mut self) -> &'a mut T would be overly restrictive, meaning that cell_mut can be only called once ever, because it locks it for as long as the source of 'a exists, and PosStencilMut can't outlive it, so that single call would have to borrow PosStencilMut for its maximum possible lifetime.

3 Likes

Thanks alot.
Oh that was now really enlightening, lifetimes are still hard on my nerves, I bluntely applied them but do not really now how they work, so the correct thing would be


impl<'a, T: GetIndex> PosStencilMut<'a, T> {
    fn cell(&self) -> &T {}
    fn cell_mut(&mut self) -> &mut T {}
    fn neighbor(&mut self, dir: usize) -> Option<&mut T> {}
}

So I assume the 'a above is a lifetime placeholder which is bound to the life time of an instance of PosStencilMut (phantomdata ?). What I dont get is.

impl<'a, T: GetIndex> PosStencilMut<'a, T> {
    fn cell_mut(&mut self) -> &'a mut T {}

the 'a is put into PosStencilMut but in the cell_mut it acts as a different lifetime?
in other words when we have fn cell_mut(&'a mut self) -> &'a mut T {} what does that 'a have in common with the PosStencilMut<'a,T> in the impl line? I understand that rust only uses these lifetime parameter to compare them between each other... (?)

You're building here an abstraction for a custom type of a reference, so you've stepped into a more advanced use of lifetimes.

The 'a on PosStencilMut<'a> is required to anchor it to its original data source, so that user can't keep it forever, past lifetime of its data source:

let cells = vec![…];
let stencil = cells.as_stencil();
drop(cells);
make_use_of(stencil); // use of a dangling pointer is prevented by the lifetime

If it was PosStencil without mut, and only supported shared references, then fn cell(&self) -> &'a T {} would actually be fine and quite useful, because it gives you the extra flexibility to do something like this:

let cells = vec![…];
let stencil = cells.as_stencil();
let some_cell = stencil.cell();
drop(stencil);
make_use_of(some_cell);

which is analogous to the object you get from vec.iter(). With shared references [stencil.cell(), stencil.cell()] is fine.

but because of &mut's absolute exclusivity, you have to prevent access to other references while the exclusive &mut is in use. Due to aliasing rules, [stencil.cell_mut(), stencil.cell_mut()] or [stencil.cell_mut(), stencil.cell()] is not allowed. The way to prevent such use is to also borrow the stencil itself so that nothing else can be called on it.

Each call to a &mut self method creates a new loan with a new lifetime when the call is made. If you make result of the call have this lifetime (which is the implied default when you don't write any 'a), then self remains borrowed for as long as the lifetime of the result needs to be. Then [stencil.cell_mut(), stencil.cell()] will give you an error that the first call still borrows the stencil, so the second call can't happen. But stencil.cell_mut(); stencil.cell() is valid, because loan from the first call can end at ; and "unlock" the stencil for use with other methods.

2 Likes