Multithreading with multiple writes to same slice without lock

I need a way to write to an array with multiple threads, but without mutexes because I need maximum performance.

My algorithm guarantees that the accessed indices are unique. I've spent the last 8 hours trying to find out how to do this, so any help would be very much appreciated.

I'm also I still don't really grok how "Cell", "Arc" and so on work because I'm new to the language.

This is what I found out, but I don't really know how to make it work:

struct UnsafeContainer<T> {
    data: UnsafeCell<T>,
}

impl<T> UnsafeContainer<T> {
    fn new(data: T) -> UnsafeContainer<T> {
        UnsafeContainer {
            data: UnsafeCell::new(data),
        }
    }
}

unsafe impl<T> Sync for UnsafeContainer<T> {}
unsafe impl<T> Send for UnsafeContainer<T> {}

pub fn caller(arr: &mut [i32]) {
    let mut buf: Vec<i32> = vec![0; arr.len()]
    do_something(&mut arr[..arr.len()/2], &mut arr[arr.len()/2..], &mut buf[..], 8);
}

fn do_something(arr1: &[i32], arr2: &[i32], arr3: &'static mut [i32], num_cores: usize) {
    let mut children = Vec::new();

    let shared = Arc::new(UnsafeContainer::new(arr3));

    for i in 0..num_cores {
        let container = shared.clone();
        children.push(thread::spawn(move || unsafe {
            println!("{:?}", container.data.get());
            // I really want to read from one of the arrays here and write to arr3
        }));
    }

    for child in children {
        child.join();
    }
}

I added the static because the compiler told me to add it, but then I have to add it to the method that calls this, and so on until I'm not able to add static anymore. :frowning: This is probably a really stupid question due to me not understanding Rust's type system, but I would be very thankful if you could help.

Edit: I also tried with Rayon, but that gives me arr1 is a & reference, so the data it refers to cannot be borrowed as mutable. Besides, by using this crate, won't it somehow use mutexes under the hood? That would defeat the purpose of my algorithm.

Cell is not a suitable abstraction for this. Cell and its relatives (RefCell, UnsafeCell, etc.) tell the compiler that "I have a shared, immutable pointer, but I would like to mutate the pointed value behind it anyway, so do not assume immutability when optimizing the code".

It is also a big red flag that you need unsafe for such a simple thing. Adding 'static blindly usually doesn't help, either, but it does seem to be necessary here, because spawn()ed threads may outlive the scope of the spawning function.

In contrast, your problem is that you want to tell the compiler that you have distinct, non-overlapping subslices. For this reason, I was initially going to suggest the following. Instead of trying to borrow the whole thing mutably multiple times, or by trying to cheat with UnsafeCell, define non-overlapping subslices, perhaps using the standard split_at_mut() or split_first_mut() methods of mutable slices.

However, due to the 'static requirements above, that won't work in this case. I think the cleanest solution would be to return the computed subarray from the thread function, and use the fact that join() returns this result. See this playground:

fn do_something(arr1: &[i32], arr2: &[i32], arr3: &mut [i32], num_cores: usize) {
    // try making subslices of a balanced size, round up
    let total_num_elements = arr3.len();
    let max_num_elements = (total_num_elements + num_cores - 1) / num_cores;

    let children = (0..num_cores).map(|i| {
        let start = i * max_num_elements;
        let num_elements = if i < num_cores - 1 {
            max_num_elements
        } else {
            // last child gets the rest of the slice
            total_num_elements - i * max_num_elements
        };
        let sub_arr1 = Vec::from(&arr1[start..][..num_elements]);
        let sub_arr2 = Vec::from(&arr2[start..][..num_elements]);
        
        let handle = thread::spawn(move || {
            println!("Thread #{} started with range {}..{}", i, start, start + num_elements);
            // perform computation and return partial results
            sub_arr1
                .into_iter()
                .zip(sub_arr2)
                .map(|(x, y)| x * y)
                .collect::<Vec<_>>()
        });
        
        (i, start..start + num_elements, handle)
    })
    .collect::<Vec<_>>();

    for (i, range, handle) in children {
        let partial_results = handle.join().unwrap();
        arr3[range.clone()].copy_from_slice(&partial_results);
        println!("Thread #{} COMPLETED with range {}..{}", i, range.start, range.end);
    }
    
    println!("{:?}", arr3);
}
1 Like

Thank you very much for your answer! I don't think this will suit my purpose though, if I understand your solution or Rust isn't somehow magically optimizing the intermediate result away. Copying the results to the array after collecting them is too slow for my purpose, I need to write them into the third array in parallel as well. I know this seems awfully contrived, but I really need this.

Edit: it's actually not just for performance reasons, I need to write into arr3 at specific indices, so collecting them and copying them won't do. The indices are calculated ad hoc by each thread, I don't know them before the parallel computation.

I'm curious for the reason. If the reason is that you need to call this in a loop, then it would be better to hoist the parallelism out of the loop, and work with larger chunks of data.

I'm trying to code some parallelized variants of sorting algorithms, for example mergesort. In that case, the parallelization actually takes place in the merging phase. The idea is to get an element from one half, do a binary search on the second half and add the computed index to the index of the original array. The resulting index is the position in the third array. So I can't parallelize larger chunks, in fact the algorithm should generalize to an arbitrary amount of processors, the more, the better. Optimally every core would write only one element.

Would rayon::scope help?

1 Like

Considering that parallelizing the merge algorithm doesn't exactly lead to mutually disjoint sub-slices, I don't think there is a way to do that in safe Rust. However each thread could provide a slice of the resulting indexes and then have a single thread perform the permutation.

1 Like

I tried, but I get this error: lifetime 'static required rustc (E0621)
Frankly, I don't really know what to do with this kind of error and simply slapping static on everything doesn't seem to help the problem. I think the scoped thread is supposed to solve exactly this problem because now the compiler can know the the lifetime is less than that of our object. But somehow it doesn't want to work.

Any idea how to do this in unsafe?

I can't use a single thread to write to the buffer (arr3). That defeats the purpose of the algorithm. The resulting merge runtime would be O(n) again, which is the original runtime of classical mergesort when it should really be a parallel runtime.

I'm also looking at potentially implementing other algorithms that don't care about data races. I picked Rust because of its great concurrency support, but is this maybe something that Rust isn't suited for? I'd be happy to just write one line of unsafe code that mutates arr3, but it seems more complex than just the equivalent C/C++ code. It would be nice if there was just an unsafe mode that let me do whatever I want and trusted my choices, as that is what I understand unsafe is for.

You can avoid the errors about 'static by using crossbeam::scope to spawn your threads. As for writing to the array without synchronization, you can do that with UnsafeCell or raw pointers.

1 Like

If Rust's safety features are incompatible with your algorithm, you can write it using raw pointers. You'll be manually responsible for maintaining all of the memory invariants that rustc expects; they are different from C's invariants, sometimes in subtle ways-- This is the ultimate 'do what I say' mode in Rust, and other unsafe approaches are usually easier to get right.

assert!(arr3.len() > 2);
let arr3_ptr = arr3.as_mut_ptr();
unsafe {
     arr3_ptr.offset(1).write(42);
}
1 Like

You can do it with UnsafeCell like this:

use std::cell::UnsafeCell;

#[derive(Copy, Clone)]
struct UnsafeSlice<'a, T> {
    slice: &'a [UnsafeCell<T>],
}
unsafe impl<'a, T: Send + Sync> Send for UnsafeSlice<'a, T> {}
unsafe impl<'a, T: Send + Sync> Sync for UnsafeSlice<'a, T> {}

impl<'a, T> UnsafeSlice<'a, T> {
    pub fn new(slice: &'a mut [T]) -> Self {
        let ptr = slice as *mut [T] as *const [UnsafeCell<T>];
        Self {
            slice: unsafe { &*ptr },
        }
    }
    
    /// SAFETY: It is UB if two threads write to the same index without
    /// synchronization.
    pub unsafe fn write(&self, i: usize, value: T) {
        let ptr = self.slice[i].get();
        *ptr = value;
    }
}

Note that unlike the approach that just uses raw pointers, this approach still has the compiler verify that you have no use-out-of-bounds or after-frees, which you can do by using crossbeam::scope. Only the data-race part is up to you.

Edit: Added running of destructors of previous values in array.

5 Likes

If you're working exclusively with i32s, I think you can also reinterpret them as atomics:

let arr3: &[AtomicI32] = unsafe { &*(arr3 as *mut [i32] as *mut [AtomicI32]) }

This gives you synchronized access, but at a much finer level of control than you'd get with a Mutex.

1 Like

I strongly suggest you read the first Rust Koan.

This feels complex because it is complex... Using unsafe Rust or C doesn't make it any less complex, it just makes the compiler warnings go away.

That said, you may want to have a look at rayon's implementation of a parallel merge sort for inspiration. It's mostly copied from the Rust standard library, with adaptations to enable parallelism.

Oh wow, I can't reiterate how much this helps me. Thank you. I didn't only ask this question here, and no one could help me. This really is great, made my day!

Am I correct that manually implementing Sync and Send is the only way, because I read between the lines that there is a more "raw" way to do it. Mainly asking out of curiosity, but other than that this finally solves my problem. :pray:

The "more raw" way is to just use raw pointers directly with things such as arr3_ptr.offset(1).write(42) like in @2e71828's post. You still need to manually implement Send and Sync on a wrapper type to do that though.

The "more raw" method has the downside of not verifying the code to be free of out-of-bounds or use-after-free either, which the UnsafeCell approach does help you with.

1 Like

Thank you, I'll check it out.

I also asked on StackOverflow but didn't get a good answer. Would it be okay if I post your code on StackOverflow as a Community Wiki, or do you want me to point you to said thread so you can answer it yourself?

If you link it here, I'll post a comprehensive answer.

1 Like

https://stackoverflow.com/questions/65178245/rust-writing-to-shared-memory-with-multiple-threads-no-mutexes

That is the post. It has one downvote, probably because my initial question wasn't so well received. I hope I improved it now so it's beneficial for others who see this in the future.