Fill submatrices in parallel: what's the best practice?

Hi rust programmers!

Newcomer to rust. Background of this problem is somehow related to scientific programming.

The task is to fill a matrix (or n-dimension tensor) by submatrices (subtensors). For example, the final matrix is

-1 | 11 | 12 12
---------------
-1 | 21 | 22 22
-1 | 21 | 22 22
---------------
-1 | -1 | -1 -1

I hope to fill this matrix 1) in parallel 2) inplace 3) as efficient as possible (probably ndarray is not a best choice for efficiency in this specific task). Since matrix (tensor) is not one-dimension vector, I think simply make segmentation to matrix (tensor) does not fit in this case.

A code snip (which does not work due to cannot borrow `result` as mutable) could be

use rayon::prelude::*; // 1.10.0

fn main() {
    let mut result: Vec<i32> = vec![-1; 16];
    let stride_x = vec![0, 1, 3];  // not known in production
    let stride_y = vec![1, 2, 4];  // not known in production
    
    (0..4).into_par_iter().for_each(|ab| {  // 0..(stride_x.len()-1)*(stride_y.len()-1)
        let a = ab / 2;  // ab / (stride_y.len()-1)
        let b = ab % 2;  // ab % (stride_y.len()-1)
        let len_stride_x = stride_x[a+1] - stride_x[a];
        let len_stride_y = stride_y[b+1] - stride_y[b];
        // following is stride-dependent, computation bounded in production
        let val = vec![(10 * (a + 1) + (b + 1)) as i32; len_stride_x * len_stride_y];
        // write inplace
        for i in stride_x[a]..stride_x[a+1] {
        for j in stride_y[b]..stride_y[b+1] {
            let i_ = i - stride_x[a];
            let j_ = j - stride_y[b];
            result[4*i+j] = val[i_ * len_stride_y + j_];
        }}
    });
    
    // output of this demo
    for i in 0..4 {
        println!("{:?}", &result[i*4..i*4+4])
    }
}

I'm looking forward if there's any rustic solution could be found. Also accepting unsafe rust if that's necessary. Thanks in advance!

The simple case of concurrently filling a vector is when each of your subsections of the container are contiguous. In that case you can use slice::split_at_mut() repeatedly to obtain non-overlapping mutable references to each slice you need.

In your application you probably do not want to reorganize the result matrix to do that. So, that does leave you with using unsafe code for the indexing. If you do so, the thing to keep in mind is that the intermediate stages must be done with raw pointers because they contain overlap. You get a slice pointer from the destination Vec, index it from your parallel tasks (perhaps using a "submatrix" wrapper type that can provide a safe API), and once you have got a pointer to a single element (which you know doesn't overlap other elements being accessed elsewhere), you can then convert back to an &mut i32.

However, ndarray already provides multidimensional splitting: https://docs.rs/ndarray/latest/ndarray/type.ArrayViewMut.html#method.split_at. So, you should try using it. I don't see any reason to expect that it would be slower than writing your own code.

5 Likes

Thanks for reply!

For the method slice::split_at_mut(), I guess that it could be applied when only one dimension of matrix (tensor) is indexed by parallel, and leaving other dimensions to be executed inside spawned threads. This is also a reachable solution.

Thanks for the second suggestion for usage of raw pointers! I'm not familiar to raw pointers in rust, and I will make some tries for that.

I'm somehow familiar to numpy, so actually I wrote such kind of code by ndarray at first (with dimensionality IxDyn), but found write-in-place took too much time, which reminds me that transposition of large tensor in numpy also takes huge time. I guess that for dimensionality IxDyn, copying strided tensor will spend much time on indices computation, and probably SIMD could not be fully utilized (even with -O3). I haven't tried out dimensionality with certainty such as Ix2 or Ix3, so currently not sure if this efficiency loss only happens to IxDyn.

Exactly. And this may in fact be entirely adequate parallelism, if the one dimension you split has enough length to occupy all available cores.

In general, something to keep in mind: maximizing splitting is not necessarily more efficient. An iteration that distributes its work across cores (as rayon does) cannot take advantage of basic compiler optimizations of loops such as unrolling and SIMD, and has more synchronization overhead. So, when the work to be done per element is small numerical computations, you should make sure that there is some innermost loop that handles a decent chunk of elements in an ordinary serial way (using std::iter::Iterator instead of parallel iterators) to ensure you're getting the maximum throughput available from each core, not just occupying all cores.

When adding parallelism, it's easy to think: I have 100% CPU utilization across all cores, so I have achieved maximum parallelism, hooray. But, that measurement doesn't tell you anything about the throughput, which is just as important. Measured in consumption of CPU cycles per work done, parallel code is always less efficient than sequential code, due to synchronization overhead; it just lets you spend more cycles per second so that the final answer is available faster.

5 Likes

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.