I'm trying to implement a par_swap_remove, to parallelize a Vec::swap_remove using rayon.
The idea is to copy the n-th last elements of the Vec into the positions of n indices which are to be removed (assuming no duplicates in the indices).
I've tried for about 1.5 hours now and this is what I've come up with. It segfaults and doesn't do what it's supposed to.
I'm extremely inexperienced with the dark arts of unsafe Rust and have struggled to share the Vec across threads - which is why I created the SafeSync struct. How do I do this properly, without locking?
Additionally, I couldn't find a method to directly alter the length of the Vec. (see FIXME in the playground)
Furthermore, I couldn't make it not consume the Vec, when using SafeSync to pass the Vec between threads. I'd much rather have the procedure take a mutable reference to the Vec.
I'm worn out from battling the borrow-checker. Please help me.
Also, you are creating aliasing unique pointers with your implementation of SafeSync, this is UB. I cleaned up the code, and moved to using raw pointers instead. You still need SafeSync to pass the raw pointer. Once you validate the input par iter, you should be set!
Actually, I think you will still have data races, for example,
input = [0, 1, 2]
indicies = [0, 2]
Both of these indicies will try and swap with the index 2, I don't know a way around this other than not swapping on these sorts of inputs. (so you will need to check that the list of inputs, and where they would swap is unique, I'm not sure if that is possible given how par_iter.enumerate() works)
It seems like unique swap indices would be implied if the indices which are to be removed in parallel, don't overlap with the last n indices which have not yet been removed.
I could guarantee unique indices if I assume the indices to be sorted, lets say in ascending order - checking uniqueness and sortedness is easy.
Putting the indices to in a Vec<usize>, I can take from the iterator in a sequential manner until the indices don't swap into the nth-last area anymore. Then, when it's safe, I can continue in parallel.
Is this just for fun or are you actually doing this for performance? If it's the latter, have you benchmarked it?
I'd be surprised if parallelizing this actually speeds anything up noticeably. The amount of CPU work per item is extremely small, and I imagine the costs largely boil down to memory fetching. There is also a high level of contention for reading and writing to cache lines near the end of the vec.
Mostly for fun, though I had still hoped to see some speedup.
I just benched the version I posted and it's slower than the sequential version, even for large arrays with 2 million elements and 1.1 million elements being removed.