How do I parallelise this loop with Rayon? (iteration by non-contigous blocks)

I'm trying to iterate over pixels in an image (a flat Vec<u16>) by blocks of 2x2 pixels. I have a function block_to_indices() that returns the indices of the pixels in the i-th block:

fn block_to_indices(width: usize, i: usize) -> [usize; 4] {
	let row_floored = 2 * i / width;
	let row_i = row_floored * width;
	[
		row_i         + 2 * i, row_i         + 2 * i + 1,
		row_i + width + 2 * i, row_i + width + 2 * i + 1,
	]
}

For example, for an image of 6 pixels wide, for the the 0-th block, this gives [0, 1, 6, 7].

I'm iterating over the blocks like this, appending the pixel values to another array result, depending on if any of the pixels in the 2x2 block are saturated:

let mut result: Vec<Vec<f32>> = vec![Vec::with_capacity(n_images); a[0].len()];
(0..vec.len() / 4).into_iter().for_each(|block_i| {
	let block = block_to_indices(width, block_i);
	if !is_saturated(vec, block, wl) {
		block
			.iter()
			.for_each(|&i| {
				let corrected_val = vec[i].saturating_sub(bl) as f32
					/ 2f32.powi(evs[image_i]) + bl as f32;
				result[i].push(corrected_val);
			})
	}
})

Now, I'm trying to parallelise this loop using Rayon. I tried just replacing into_iter() with into_par_iter(), but of course the issue is that I then cannot write to result from multiple threads.

I tried to reshape this loop into something that I can write using a map(), but that is a bit difficult. I could, instead of looping over block_i, loop over the individual pixels, and cook up a function that calculates the block_i from the pixel index (i.e. the inverse of block_to_indices), but that means I'm calling is_saturated 4x as often as is required.

Rayon does have par_chunks(), which allows you to iterate by chunks, but the problem is that in my case the chunks are non-contiguous, so I cannot use that. My blocks/chunks are non-overlapping though, so in principle writing to result[i] should be sound (i.e. no two threads are ever writing to the same element of result), even though I don't know how to prove that to the compiler.

Another method could be to map() and collect() the results per block, and then later re-order them, but I feel like cost of the re-ordering (which needs to be single-threaded) would negate the benefits of parallelise the above loop.

Any ideas? Is it possible to keep the above approach, but convince the borrow-checker in some way that the blocks are non-overlapping? Or, maybe using unsafe{}?

There's no practical way to do this kind of non-contiguous slicing in a way the borrow checker or Rayon accepts natively.

You can use unsafe code. If you choose to do so, it is critical that you convert from &mut to *mut (raw pointer) before any parallelism starts — you must not use &mut in an overlapping fashion. Once you have that working, consider wrapping it up in a FromParallelIterator implementation — it is good to isolate unsafe code into safe wrappers, for modularity and to make it clear what expectations the unsafe code has.

You can also achieve parallel writes by writing to a vector of AtomicU32s that contain the bits of your f32s. These may be less efficient but requires no unsafe code.

Another option is to order your output so the blocks are contiguous, then reorder it into the needed order as a separate step. Or, maybe you can just parallel iterate over pairs of rows and that will be enough parallelism? Then within each row you can write using regular indexing.

1 Like

Thanks. About the unsafe route, I tried it like this:

let mut result: Vec<Vec<f32>> = vec![Vec::with_capacity(n_images); a[0].len()];
let pointer = result.as_mut_ptr();

(0..vec.len() / 4).into_iter().for_each(|block_i| {
	let block = block_to_indices(width, block_i);
	if !is_saturated(vec, block, wl) {
		block
			.iter()
			.for_each(|&i| {
				let corrected_val = vec[i].saturating_sub(bl) as f32
					/ 2f32.powi(evs[image_i]) + bl as f32;
				unsafe {
					(*pointer.add(i)).push(corrected_val);
				}
			})
	}
})

That works single-threaded, but when I change into_iter() into into_par_iter(), it fails:

error[E0277]: `*mut Vec<f32>` cannot be shared between threads safely
   --> src/main.rs:45:47
    |
45  |           (0..vec.len() / 4).into_par_iter().for_each(|block_i| {
    |                                              -------- ^--------
    |                                              |        |
    |  ____________________________________________|________within this `{closure@src/main.rs:45:47: 45:56}`
    | |                                            |
    | |                                            required by a bound introduced by this call
46  | |             let block = block_to_indices(width, block_i);
47  | |             if !is_saturated(vec, block, wl) {
48  | |                 block
...   |
57  | |             }
58  | |         })
    | |_________^ `*mut Vec<f32>` cannot be shared between threads safely

It seems I cannot pass the raw pointer between threads. How do I circumvent this? (maybe this is simple, but this is my first time writing unsafe code... :wink: )

  • define a struct containing the pointer as a field
  • use unsafe impl Send and Sync on that struct
  • wrap the pointer in your struct
  • in the closure, at least once take a reference or something to the whole struct (not just its pointer fields) to make the closure capturing algorithm capture the (reference to the) struct as a whole.

You can use the rayon feature of ndarray.

Here is the ndarray code to create the iterator: Rust Playground

Sadly the playground does not have the rayon feature enabled. If you do so it should be enough to put a into_par_iter() in there. I would suggest making the outer iterator parallel to avoid false sharing when writing.

note that this code seems to yield repeated indices if width is odd.

Or like this without ndarray

    data.par_chunks_exact_mut(2 * width).into_par_iter().for_each(|chunk| {
        let (p1, p2) = chunk.split_at_mut(width);
        p1.chunks_exact_mut(2).zip(p2.chunks_exact_mut(2)).for_each(|(r1, r2)| println!("[{r1:?}, {r2:?}]"));
    });
2 Likes