Turning mutable grid access into a thread-safe grid access (stencils)

Heads up: I didn't try (or even look at yet) your code myself, so this is just spitballing.


If I understand your stencils correctly, you do four passes like so.

+---+---+---+---  1st pass: ABDE and contiguously from there
| A | B | C | ..
+---+---+---+---  2st pass: BCEF and contiguously from there
| D | E | F | ..
+---+---+---+---  3rd pass: DEGH and contiguously from there
| G | H | I | ..
+---+---+---+---  4th pass: EFHI and contiguously from there
| : | : | : | `.

Instead of a stencil per unit of (parallel) work, you could send a pair of rows as a unit of work, with two total passes. The first pass would handle ABDE and BCEF stencils (no vertical offset) and the second would handle DEGH and EFHI stencils.


Moreover, you can handle both stencils in once pass (within a given unit of work) if you use a lending iterator pattern instead of an Iterator:

+---+---+---+---+---+---+---+---+
| A | B | C | D | E | F | G | H |
+---+---+---+---+---+---+---+---+
| S | T | U | V | W | X | Y | Z |
+---+---+---+---+---+---+---+---+
  • Do AB.ST
  • Hand out BCD.TUV and do CD.UV then BC.TU
  • Hand out DEF.VWX and do EF.WX then DE.VW
  • Hand out FGH.XYZ and do GH.YZ then FG.XY
  • Do H.Z

If you label the first pass stencils 0 2 4 6 and the second pass, 1 3 5 7 (i.e. by offset), you're doing

  • 0 21 43 65 7

That is, always doing the two first-pass stencils (2, 4) before the second-pass stencil (3) that they overlap, in a pipeline fashion.

But I don't know if that would be faster or slower than two passes with a normal Iterator.

1 Like

@quinedot : Thats all correct what you said, thanks for the idea I guess I just simply must make more work available to the cores. I guess the first solution doenst require too much change and I will think about parallelizing over „n/cores“-rows (x-major storage) and then reuse the stencil iterator as a non-parallel iterator over the row slice. But I still need to do this 4 times with the same shifts as you described but the load hopefully increases because one parallel iterator encompasses work for n/cores rows.

I have to dig into lending iterator pattern, not sure what you mean with it?

An Iterator allows consuming the iterator in order to collect all the items at once (.collect()), so the items must be able to outlast the iterator. A lending iterator, on the other hand, hands out borrows of itself:

fn next<'a>(&'a mut self) -> Option<Self::Item<'a>>

So it can hand out borrows of items it contains (loans) -- but the consumer of the iterator can't hold on to more than one item at a time, due to the lifetime constraints (you have to get rid of the item before you can call next again).

It's necessary in this case because in the pattern I described, the items overlap in memory -- so having two of them at the same time would be aliasing of a &mut and instant UB. (Alternatively you can use interior mutability and the normal Iterator trait.)

Overlapping immutable borrows are ok, so we have windows, but the "overlapping &mut are UB" is why we don't have windows_mut. The pattern I described is a variation on windows_mut.

The way you do it (without making your own lending iterator traits -- there is no such thing in std yet) is to have some lending iterator struct with a next method like I showed above:

struct DoubleStencil<'a> { /* ... */ }
impl DoubleStencil<'_> {
    fn next(&mut self) -> Option<Stencil<'_>> { /* ... */ }
}

And then you use it like so:

while let Some(stencil) = double_stencil.next() { /* ... */ }

Or perhaps

impl DoubleStencil<'_> {
    // Or `&mut self` if you want to be able to return a "remainder"
    // ala `ChunksExactMut`
    // https://doc.rust-lang.org/std/slice/struct.ChunksExactMut.html#method.into_remainder
    fn for_each<F: FnMut(Stencil<'_>)>(mut self, f: F)  {
        while let Some(stencil) = self.next() {
            f(stencil);
        }
    }
}