Hi everyone. I've been working on a project called paradis that aims to drastically simplify the writing of parallel code for cases where the access pattern is very difficult to express with Rust at the moment. It integrates nicely with rayon, and lets you obtain parallel iterators for a wide variety of access patterns with safe code, or a minimal amount of unsafe code.
" In principle, most of the abstractions could be zero-cost. However, similar to iterators, they are at the mercy of compiler optimizations."
I would say the issue here is caching, if you do not organise parallel processing with processor cache-lines in mind, performance is likely to be poor ( significantly slower than a single thread ). Apologies if I did not read the whole log carefully enough, I did just read quickly, and this was my reaction to the subject matter.
The big thing I'm missing, on looking at the blog post, is a description of how you avoid cache line thrashing.
CPUs don't access memory byte-at-a-time; they access it cache line at a time, and one thing that's critical when you're trying to write parallel code for performance is ensuring that each cache line of memory is either shared read-only, or exclusive to a single CPU.
In a library like paradis, I'd expect to see some effort made to ensure that two parallel threads never touch the same cache line; your motivating example ("one thread touching even indexes, the other touching odd indexes") makes this potentially interesting, since you could talk about how you rearrange that into one thread touching - say - 16 items (one x86-64 cache line), skipping 16, then touching 16, to avoid the cache line bouncing between CPU cores if the threads run in parallel.
One thing that would be nice to support is not only "unique" indices (which implies discrete, one-by-one iteration), but also general non-overlapping ranges. (This in turn would mean that the name UniqueIndexList should probably be changed to something like NonoverlappingIndexList. I see that the name DisjointIndexLists, plural, is already used for something else, which may not be optimal.)
@geebee22, @farnz: false sharing and cache thrashing are definitely crucial concerns for parallel programming. Not paying attention to it may indeed cause massive performance issues.
The goal of paradis is to allow you to express as many different kinds of parallel patterns as possible. It's up to the user to make the judgment of whether or not false sharing is a significant concern for their application. I think a library cannot both aim to be general, while at the same time taking this into account. I've chosen to attempt to solve the general problem here.
The even-odd example is an artificial toy example, and as presented will definitely run into problems with false sharing. However, if you replace the integers with some bigger type that's perhaps padded to the size of a couple of cache lines, you might not see any false sharing. Or, alternatively, perhaps the workload per entry is so great that false sharing is not much of a concern.
It's not entirely clear to me what you mean. Do you want to iterate over, say, integers described by the union of several non-overlapping ranges? For example, iterating over union(0 .. 3, 8..10) would yield the elements 0, 1, 2, 8, 9?
If so, then this could be done if paradis implemented a combinator like (0 .. 3).index_chain_bounded(8..10) which would just check that the bounds of the ranges (index lists) don't overlap in O(1) time.
I was expecting, given that you'd chosen a toy example that's chock-full of false sharing, to see some discussion of it in the context of paradis.
At one extreme, you could explain what it is, why it makes the example of mutating every other item from two threads bad, and how you design your system to avoid it - noting that shared-for-read is usually fine, it's just shared-for-write that's always a problem.
At the other extreme, create_par_iter could guarantee no false sharing, at the expense of less parallelism.