Mutably iterating over a sparse subset of items from a Vec (via their indices) in parallel (using Rayon)

Hi! I'm new to Rust, and below is some (non-working) Rust code that tries to mutably iterate over a subset of the Vecs within vecs and push an integer onto each in parallel (playground):

use rayon::prelude::*; // 1.1.0

fn main() {

    let mut vecs = vec![
        vec![1, 2, 3],
        vec![1, 2, 3],
        vec![1, 2, 3],
        vec![1, 2, 3],
        vec![1, 2, 3],
        vec![1, 2, 3],
    ];
    
    let mut indices = vec![4, 3, 1];
    
    // Ensure all indices are unique to ensure safety:
    indices.sort_unstable();
    let len = indices.len();
    indices.dedup();
    assert!(indices.len() == len);
    
    indices.par_iter().for_each(|index| {
        unsafe { vecs[*index].push(4); }
    });
    
    println!("{:?}", vecs);
    
}

As you can see, my plan was to make sure all the indices are unique so that I can safely use unsafe code within the for_each closure which mutates the vecs. But the unsafe block doesn't seem to be working because I get this compiler error:

cannot borrow `vecs` as mutable, as it is a captured variable in a `Fn` closure

In the "real" version of this code the vecs Vec is actually a Vec of structs that I'm calling a method on which mutates the struct. The number of items in the real vec is very large (e.g. 1,000 to several million elements), and the number of indices can be very small or very large (e.g. 0.1% or 100%).

Is there a way to get this sort of thing to work? Or is there a better way to do it? I couldn't think of any way to get it to work without relying on unsafe. Thank you!

Hi, unsafe block don't make any code compile, they only allow:

Dereference a raw pointer
Call an unsafe function or method
Access or modify a mutable static variable
Implement an unsafe trait

But what you want is to get a &mut T inside a Fn. I don't know any safe way to make this code work without a lock but going even more unsafe works. playground (almost certainly UB) or with one less indirection: playground (you have to make sure all indices are valid though)

2 Likes

Thank you! I haven't had to use raw pointer stuff yet, and this is the first time I've had to use unsafe code in Rust, so thanks for your explanation and code examples!

I would suggest just adding a Mutex like
https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=8b2e7f3a9ff10b133cb79b7cf065d76e
Uncontended mutex's and especially parking_lots are very fast, it'll be a single atomic instruction for the lock and another for the unlock.

Also this can be done all in safe Rust without locks but it takes more effort.
https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=126137cdb85dab4e934cd3f6a8841b65

3 Likes

I think the first playground I made is UB, IndexMut takes a mutable reference to the outer Vec meaning every thread will have one.

@Cocalus The big downside I see with the Mutex approach is if the Vec is present in lots of areas of the code base where no threading is required.
The second one should be benchmarked, I tried the "automatic" parallelization of sequential iterator and it wasn't great in my case. Rayon will try it's best but with more information it can work better. Implementing Producer and ParallelIterator would solve the issue but that's a lot of work in total.

I collected the &mut T iterator to a Vec before using par_iter which avoids the performance issues with plain iterators. There's the cost of allocating a Vector of references but that's often fine.

Oh! Sorry I missed that :pensive:. What do you think about not making it an iterator? playground

1 Like

They'll probably be about the same speed. Iterator's in Rust can be faster than classic for loop patterns. Rayon being slower with par_bridge is due to the extra work and cache losses load balancing a stream. While my Iterator version is more flexible it's code is harder to follow.

https://play.rust-lang.org/?version=stable&mode=release&edition=2018&gist=cf9d2f42b562e05ba4099b2006669e72

I made your slice version more generic. But it has underflow issues. For example indexes of [0,0,0] will cause integer underflow. In release you'll get a different panic. The explicit panic in mine was to make the source of the issue consistent and more obvious if a bug happens.

Yes speed should be almost identical, the non iterator version just reduce the amount of code.

You can't have indices [0, 0, 0] since they have to be unique. If you replace indices before the checks it will stop at the assertion. The indices also have to be sorted to work in this version. The checks could be moved inside the function to be safer or a comment added to explain the requirements.

I should have been more clear, I'm aware that the indices should be unique, I was pointing out that it's dynamically checked so in the case of a bug the caller produces a different panic depending on release vs debug, so explicitly panic with meaningful context should provide a nicer user experience.

In general I treat integer overflow as a particularly bad panic error since it won't occur in release and more importantly the overflows in release it may not lead to another panic. Then whether or not some things panic depend on the build options which isn't great.

@Cocalus, thanks for your thoughts on this! In your Mutex solution, is there any reason you used .try_lock().unwrap() instead of .lock()? Will the former give better performance if it is known in advance that threads won't try to get a lock at the same time? Also, what is meant by "atomic intruction" in this case? If the push() method actually just incremented a counter (I'm assuming that's an atomic instruction), then would the Mutex solution have 3 atomic instructions (lock, increment, unlock) whereas the raw pointer would only have 1 (increment)? I think the Mutex solution is good for my case anyway (since the analogue of push in my case is fairly expensive), but I'm just curious.

I'm was curious about the performance the various approaches, so I timed 4 of them:

The methodology here is very bad (playground seems to sometimes provision more/less threads causing the timings to halve/double), so those results should be taken with a grain of salt, but I think they're roughly accurate, relative to one another.

Try lock was to cause a panic if they become contended. Since they shouldn't be contended it would help detect bugs. Also If there was contention it might waste CPU capacity since one of the rayon worker threads would be waiting on the lock instead of processing.

Atomic instructions are different than normal instructions they are a primitive to allow synchronization between multiple threads. It makes sure all CPUs see the same value. On some platforms different CPUs cache's of the same memory address may differ if they write to it. They also enforce ordering of instructions at the code and CPU level, since both are allowed to reorder things as long as it doesn't affect single-threaded semantics. for example if you build a data structure at some address and then set a flag that some other CPU is polling, without atomics from the other CPU's view that flag can be set before that data structure. Most multi threading primitives are built on top of atomics. They are notoriously difficult to correctly use for things much beyond basic counters. Making sure all CPU caches are in sync and to a lesser extent the limiting of reorder make them a slower than normal instructions, how bad depends on the contention for their cache line. You can find more info googling around. The rust provided atomics can be found in https://doc.rust-lang.org/std/sync/atomic/index.html

The parallel parts of MultiMutIter API and get_multi_mut are be pretty much the same as the raw pointer. But there is overhead in building mut_vecs, which is where your benchmark differences mostly come from. That can be mostly mitigated with the MultiMutIter Api by using an Iterator for the indexes instead of a Vec, so there's only one Vec built in both cases. But that may not always fit. The Mutex has the advantage of being the least restrictive since it doesn't even need the indexes ordered. I get pretty much the same speeds for the below, by not including the indexes or mut_vecs allocation.

MultiMutIter
Raw Pointer

If you want to get better benchmark measurements you could use criterion locally

6 Likes

Brilliant - thank you for the explanation! This thread has been very informative.

Hi,
i would like to throw in that you could try switching the thing you're iterating over, some thing like this:

use rayon::prelude::*; // 1.1.0

fn main() {
    let mut vecs = vec![
        vec![1, 2, 3],
        vec![1, 2, 3],
        vec![1, 2, 3],
        vec![1, 2, 3],
        vec![1, 2, 3],
        vec![1, 2, 3],
    ];

    let mut indices = vec![4, 3, 1];
    indices.sort_unstable(); // or par_sort_unstable

    vecs.par_iter_mut()
        .enumerate()
        .filter(|(i, _)| indices.binary_search(i).is_ok())
        .for_each(|(_, vec)| {
            vec.push(4);
        });

    println!("{:?}", vecs);
}

https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=46b61a279649ee683090b959374dce53

i have no idea how fast this actually is, i guess it's wasting a lot of clock cycles if indices is small.

2 Likes

Thanks! Maybe a good solution would be to benchmark the various approaches with different numbers of indices (relative to the length of vecs) and choose an approach dynamically based on that (since the number of indices can vary widely in my particular case).