Does anyone know if current_thread_index is gurenteed to return a (for this os thread unique) index in 0..NumberOfThreadsInPool?
Put another way, if I
Create a rayon threadpool with 32 threads
Create an Vec of 32 elements (each element > cache line size), call it data
Run rayon::par_iter() on a set of jobs and in each closure:
a) Get the current index using let i = current_thread_index
b) write/update data[i] += ... many times in each closure (note that this will require using unsafe / pointers to update the vec since it will then be mutated from multiple threads)
After the par_iter is finished sum up data
Would this then be a ok/safe use of unsafe / would i be gurenteed to get the same result as if I hade done it sequentially?
Provided that the thread indexes I get from rayon are consistent (they are documented as such) and in 0..32 (couldnt find this in the docs) this feels like it should be safe to me - but have been bitten by assuming that incorrectly before..
If you're asking if the current rayon version 1.12.1 will always have this property, it looks like yes. The indices are created by enumerate on a collection created from a range created from the thread count.
If you're asking if rayon is guaranteed to do this in future versions, then all there is to go off is the documentation, and the documentation doesn't make such a guarantee. Luckily it's easy to just check that the index is within the range and panic if it isn't.
More specifically, does that mean that the following code is safe?
My understanding of the unsafe pointer rules is that is should be, but I'm a bit uncertain if I need to Pin<> the Vec in MyData? However, the MyDa struct ensure that is it never resized etc. so it should never be moved?
Or here is maybe a better version where i introduce a PhantomData of a shared reference to ensure that someone does not by accident .take() out the Vec while the Shared structure is still in use by other threads:
Ref false sharing (yes it will be written often), I thought that I would completely eliminate the risk for that by ensuring that T has a size that is a multuple of the cachline size (hence the _padding: [i32; 64 / 4 - 1]) in the example. Am i missing something in assuming that it solves the problem?
The code offloads the unsafe code to spin::mutex::Mutex and uses try_lock().unwrap() to ensure unique access. Instead of manual padding I used a wrapper with align(256) which already includes the mutex.
I did not make spin part of the public interface so the implementation detail doesnt leak. You can get rid of some boilerplate if you do leak it.
However, that is not going to work for the case I'm looking at (inside a performance critical simulation, e.g. ensuring interleaving of avx fma instructions to saturate backend). Also, I operate on avx512 so the mutex would effectively double the memory requirement since a avx512 value happens to completely fill a cache-line itself.
(The "usafe" playground code, with inlining and changing the assert to debug_assert will reduce down to just a memory access inside my loop, a memory access that will execute in parallel with subsequent numeric operations)