Parallel rust - Splitting up multiple matrices


#1

Hi

I am currently playing around with parallel rust, but I fail at what looks like a very simple task. I am trying to implement a parallel jacobi solver, i.e. I have two matrices that I want to split up among multiple threads. I read from one matrix and write another one. I know that there will be no race condition if splitting up the index range of the matrix will be done correctly. However, I can’t get it to compile

Here is some sequential code https://play.rust-lang.org/?gist=9ae75a66d7ba718f2fb07d76a89d96ec&version=stable of what I am trying to do.

And here is some parallel (not compiling) code of what I had in mind: https://play.rust-lang.org/?gist=1589a49d59f40a793c514be442373cfb&version=stable This is very bare bone just showing the idea. I have tried various approaches that all did not work out.

Can anyone push me in the right direction of what I am missing?


#2

The usual way one approaches these problems in Rust is to help the borrow checker a little bit by splitting the array where you intend to write into disjoint slices. Two common algorithms for doing so are to recursively split the array in halves (via [T]::split_at_mut()) or to cut it in fixed-size chunks (via [T]::chunks()).

In your case of an index-based computation, there is a little extra twist, which is that you need to track the index information during the array split, because you will need to reach the corresponding index in the source array during the Jacobi computation.

Once you have logic for recursively splitting the array, you can go one step further and feed it to an automatic parallelization engine like Rayon so that you do not need to think deeply about how many chunks you should make (at least as many of you have CPU cores for maximal CPU utilization, more is better for load balancing, but too small a chunk and you will start paying more in overhead). Rayon is basically a cousin of Intel TBB in C++, it uses similar abstractions (fork-join, data-parallelism) and internal logic (recursive task splitting, work-stealing thread pool), and it just officially reached 1.0 which means that backwards compatibility is guaranteed in the long term, for all of these reasons I would strongly recommend it.

Nevertheless, if you are not comfortable with that level of magic and want to manage your parallel tasks yourselves, note that crates like rayon and crossbeam can also take care for you of the lifetime issues that inevitably arise when splitting work across std::threads (which are very strict about their input data being 'static, since they can outlive the main thread) by providing a “scoped thread” abstraction which guarantees that threads will be joined before the computation returns, and can accept data of finite lifetime as a result.

By putting those pieces together, you get a pretty nice setup for parallel array computations in Rust!


For more detailed discussions and some code, I will point you towards other posts I wrote on this topic, as it has been quite popular lately:


EDIT: Just saw that Rayon’s parallel iterators on indexed types like Vec have the enumerate() method, which means that you can get a 1D index into a Vec being iterated upon “for free” and convert it back to your 2D index. You may prefer this to defining a custom parallel iterator via iter::split() as I suggested in my previous posts. I’m not sure if it’s much less work or code overall, as split is already remarkably concise for what it does, but it will be higher level and easier to understand for newcomers to Rayon.


#3

Thanks a lot for the links and the detailed explanation. It took me some time, but I came up with

https://play.rust-lang.org/?gist=4415eacdfe535e2e8377ab6a6c3c6414&version=stable

It seems to produce the correct results and uses multiple threads. I am not happy with it creating threads for every iteration, but if I try to change this it looks like I need to share both matrices mutable with all threads. As a thread also needs to read data that is produced by other threads, I need this to be synchronized. A barrier as shown in

https://play.rust-lang.org/?gist=1589a49d59f40a793c514be442373cfb&version=stable

would just be fine. However, as far as I understand it:

  • I cannot split up the matrices to make the borrow checker happy, as I will read and write to both matrices and for reading there is an overlap.
  • Sharing the whole matrices would require them to have the sync trait and I don’t want to synchronize individual accesses as it is not required as long as the barrier is missing.

I guess, I am just missing something here. Any ideas? I am not using rayon on purpose for this, because I feel like I am learning more when doing it by hand.


#4

To start with a minor point, looking at your current code, I do not think you need to collect scoped thread handles in a Vec if you do not intend to use them. By itself, crossbeam will ensure that the threads are joined before the host scope is exited, so you can simplify the chunks loop into…

// loop over next chunks
for chunk in next.chunks_mut(chunk_size) {
    scope.spawn(move || {
        iteration(&current, chunk, size_x, global_index);
    });

    global_index += chunk_size;
}

…but again, this is only a nitpick with minor performance impact. Now, onto your real question: is it possible to avoid spawning and destroying threads on every iteration, while otherwise retaining the benefits of the scoped thread approach (no explicit unsafe code bypassing the borrow checker, and no synchronization of individual accesses)?


The key to answering this question in a satisfactory way is to realize that crossbeam’s scoped thread API does two things, which are only related by an API design choice:

  • It spawns threads on which work can be executed.
  • It allows for scoped work execution on these threads.

To avoid spawning and destroying threads all the time in an iterative algorithm, while otherwise retaining the usability and performance benefits of scoped threads, we will need to separate these two features by distinguishing…

  • One “outer” scope, possibly global, in which a pool of OS threads is spawned
  • One “inner” scope in which work is sent to this outer thread pool, and subjected to the usual work-scoping guarantees (wait for the end of the work before exiting the inner scope).

In this scheme, the “inner” scope sends tasks to the “outer” scope via a communication channel, and awaits a message signalling panic or task completion at the end of the scope. As with any scoped thread API, ensuring memory safety in this situation requires carefully crafted unsafe code, which is why I would not recommend that you write this code yourself when very nice people have already written it for you.


I know of two libraries which allow such scoped execution inside of an outer thread pool: rayon, via the scope() and join() APIs, and scoped_threadpool, via the Pool::scoped() API.

For learning purposes, you may want to start with scoped_threadpool, as it is a minimal implementation of the separation of thread pool and scoped execution that I discussed earlier. For production use in a larger application, I would probably favor use of rayon, as it is offers stronger API stability guarantees, has a broader feature set (e.g. can slice input work automatically if you let it do so), and has received more performance optimizations (e.g. a scalable work distribution system based on work-stealing lock-free queues).


#5

Thanks again, really appreciated (also the nitpicking). :smiley:

I have created a version using the scoped_threadpool. It is available here

https://play.rust-lang.org/?gist=2317992385aea6da068096fb49f40d86&version=stable

in case anyone else is interested. If anyone sees anything that I should have done differently, feel free to let me know. I thought a bit about if I should try to include the while loop in the Pool::scoped() closure, but the borrow checker made this complicated/impossible(?) and have decided to leave it as is for now.


#6

To give you an explanation on this, it’s because scope::execute() extends the borrow of matrix due to passing current in there and so you can’t call matrix.swap(). Because execute() doesn’t wait for the closure to finish, the borrows of anything in there cover the lifetime of the scope itself, which is the scoped() block.

This is a correct design on scoped_threadpool’s part. As far as it’s concerned (and the borrow checker), it wants to make sure that mutable borrows cannot be shared across the workers running execute().

Once the scoped() call completes, all work is done and you can borrow the matrix again. That’s why the loop outside works.

At least that’s my understanding.


#7

From a higher-level point of view, this is the borrow checker saving you from a data race :slight_smile:

The end of a scope is the synchronization barrier where a scoped task API will wait for the tasks to complete. If you included the entire while loop into the scope, then scoped_threadpool would not wait for each iteration to complete before starting the next one. And therefore, you would start iteration N+1 before iteration N is finished, and potentially read half-written data as a result…


EDIT: Beyond that, advanced nitpicking to the rescue!

There are a couple of things which you could want to encapsulate further in a “production” version of this code, and doing so might help you get more familiar with the Rust type system. One example is multi-dimensional array manipulations. In an ideal world, you might want to, for example…

  • Manage size_x and size_y together with the associated data in a single opaque object (so that e.g. the size information cannot get out of sync with changes to the underlying matrix).
  • Support 2D indexing, without user-side manual index computations.
  • Add a way to borrow (possibly mutable) 2D slices/chunks from a 2D array.
  • Provide a higher-level iteration pattern over a 2D slice where the user does not manually loop over indices (and then make it faster than a hand-written double loop by eliminating the array bounds checks from it).
  • Support cutting off the edge elements of the output matrix in a single low-cost operation (i.e. get a “view” of the output matrix where the indices are the same, but the edge elements are absent from high-level iterators and trying to access them manually is a panic-inducing error).

The ndarray crate provides native support for many of these operations, but you can experiment with implementing some of them yourself as a learning exercise if you like.

Another example of pattern which you might like to abstract away is the double buffering design that you are using for flipping between an “input” and “output” matrix. This sounds like a nice self-contained and generally useful abstraction that you may want to extract for future use, if someone has not done so already.

Finally, now that you have seen how to use scoped threads directly, you can try to make a “high-level Rayon” (parallel iterator based) version of this code, just to get a feel of how the underlying abstractions differ, and what you can and can’t do with parallel iterators.