I need to share AtomicUsize between threads, and was surprised to find that dereferencing Arc<AtomicUsize> was 4 times faster than dereferencing *const AtomicUsize. This is counterintuitive because I thought that they should have similar cost. Any explanations?
use std::{
sync::{
atomic::{AtomicUsize, Ordering},
Arc,
},
thread::spawn,
time::Instant,
};
struct Wrapper(*const AtomicUsize);
unsafe impl Send for Wrapper {}
// `test_ptr()` and `test_arc()` are identical except that
// the former uses *const AtomicUsize while the latter uses Arc<AtomicUsize>
fn test_ptr(threads: usize, count: usize) {
let mut handles = Vec::with_capacity(threads);
let values = (0..threads)
.into_iter()
.map(|_| AtomicUsize::new(0))
.collect::<Vec<_>>();
for v in &values {
let p = Wrapper(v);
handles.push(spawn(move || {
for _ in 0..count {
(unsafe { &*p.0 }).fetch_add(1, Ordering::Relaxed);
}
}));
}
while values.iter().any(|v| v.load(Ordering::Relaxed) < count) {}
for h in handles {
h.join().unwrap();
}
}
fn test_arc(threads: usize, count: usize) {
let mut handles = Vec::with_capacity(threads);
let values = (0..threads)
.into_iter()
.map(|_| Arc::new(AtomicUsize::new(0)))
.collect::<Vec<_>>();
for v in &values {
let p = v.clone();
handles.push(spawn(move || {
for _ in 0..count {
p.fetch_add(1, Ordering::Relaxed);
}
}));
}
while values.iter().any(|v| v.load(Ordering::Relaxed) < count) {}
for h in handles {
h.join().unwrap();
}
}
fn main() {
const THREADS: usize = 10;
const COUNT: usize = 0x100000;
let now = Instant::now();
test_ptr(THREADS, COUNT);
println!("Using *const AtomicUsize took {:?}", now.elapsed());
let now = Instant::now();
test_arc(THREADS, COUNT);
println!("Using Arc<AtomicUsize> took {:?}", now.elapsed());
}
On release build:
Using *const AtomicUsize took 147.140065ms
Using Arc<AtomicUsize> took 32.4519ms
On debug build:
Using *const AtomicUsize took 304.676667ms
Using Arc<AtomicUsize> took 59.330349ms
You have false sharing. Basically atomics work at the cache line (typically aligned 64byte). So different atomics right next to each other can contend if they're on the same cache line. Arcs are separate allocations on the heap and have the atomic counters (making them bigger) so they're not very likely to have false sharing.
Padding the vector of atomics such that each atomic is on a different cache line fixes the performance and beats Arc, but you still have unsafe code. Depending on what you're trying to do, Rayon maybe able to do it safely and even faster by avoiding granular atomics entirely.
use std::{
sync::{
atomic::{AtomicUsize, Ordering},
Arc,
},
thread::spawn,
time::Instant,
};
struct Wrapper(*const AtomicUsize);
unsafe impl Send for Wrapper {}
// `test_ptr()` and `test_arc()` are identical except that
// the former uses *const AtomicUsize while the latter uses Arc<AtomicUsize>
fn test_ptr(threads: usize, count: usize) {
let mut handles = Vec::with_capacity(threads);
let values = (0..threads)
.into_iter()
.map(|_| AtomicUsize::new(0))
.collect::<Vec<_>>();
for v in &values {
let p = Wrapper(v);
handles.push(spawn(move || {
let p = p;
for _ in 0..count {
(unsafe { &*p.0 }).fetch_add(1, Ordering::Relaxed);
}
}));
}
while values.iter().any(|v| v.load(Ordering::Relaxed) < count) {}
for h in handles {
h.join().unwrap();
}
}
// `test_ptr()` and `test_arc()` are identical except that
// the former uses *const AtomicUsize while the latter uses Arc<AtomicUsize>
fn test_ptr_padded(threads: usize, count: usize) {
let mut handles = Vec::with_capacity(threads);
let values = (0..threads)
.into_iter()
.map(|_| (AtomicUsize::new(0), 0u64,0u64,0u64, 0u64,0u64,0u64, 0u64))
.collect::<Vec<_>>();
for v in &values {
let p = Wrapper(&v.0);
handles.push(spawn(move || {
let p = p;
for _ in 0..count {
(unsafe { &*p.0 }).fetch_add(1, Ordering::Relaxed);
}
}));
}
while values.iter().any(|v| v.0.load(Ordering::Relaxed) < count) {}
for h in handles {
h.join().unwrap();
}
}
fn test_arc(threads: usize, count: usize) {
let mut handles = Vec::with_capacity(threads);
let values = (0..threads)
.into_iter()
.map(|_| Arc::new(AtomicUsize::new(0)))
.collect::<Vec<_>>();
for v in &values {
let p = v.clone();
handles.push(spawn(move || {
for _ in 0..count {
p.fetch_add(1, Ordering::Relaxed);
}
}));
}
while values.iter().any(|v| v.load(Ordering::Relaxed) < count) {}
for h in handles {
h.join().unwrap();
}
}
fn main() {
const THREADS: usize = 10;
const COUNT: usize = 0x100000;
let now = Instant::now();
test_ptr(THREADS, COUNT);
println!("Using *const AtomicUsize took {:?}", now.elapsed());
let now = Instant::now();
test_ptr_padded(THREADS, COUNT);
println!("Using *const AtomicUsize Padded took {:?}", now.elapsed());
let now = Instant::now();
test_arc(THREADS, COUNT);
println!("Using Arc<AtomicUsize> took {:?}", now.elapsed());
}
Using *const AtomicUsize took 89.9593ms
Using *const AtomicUsize Padded took 16.5459ms
Using Arc took 30.5055ms
Here's with rayon note the compiler can now cheat the inner for loop, since it can optimize better without atomics.
fn test_rayon(jobs: usize, count: usize) {
use rayon::prelude::*;
let mut values = (0..jobs as u64).collect::<Vec<_>>();
values.par_iter_mut().for_each(|v| for _ in 0..count{*v+=1;});
}
Hmm, I saw it, but the code does compile by rustc 1.58.1 (db9d1b20b 2022-01-20) on the command line. Can you try it? I wonder why the same code compiles with rustc but not on the playground.
I deleted my reply as @Cocalus is obviously on the right track here which is proved by the new examples.
I would in any case be wary of using a *const T this way. Atomics basically compile down to some atomic instruction(s)/memory barriers which should be fine when considering thread safety, but I'm not a 100 % sure what guarantees, with respect to compiler optimizations, you get when using a *const T to a value that's both shared and changed (even if it's thread safe to do so). It would be interesting to know the answer to that (I would think that it's UB to do so, but TBH I'm not sure).
It's safe as there isn't change. But better not to even risk it. Internal mutation is distinct and without a bug in compiler it is fine.
Pointer invalidation is the big risk but even though the while values.iter() is potentially above their last use, it just reads. (and the drop is at the end after threads have finished.)
If some mutation were to be added you will likely get some UB.
let mut a = AtomicUsize::new(0);
let p = &a as *const AtomicUsize;
let _r = &mut a;
let _ = unsafe { &*p }.fetch_add(1, Ordering::Relaxed);
Yeah, you're probably right. And MIRI seems to agree that if you happen to add any mutation it's definitely UB:
error: Undefined Behavior: trying to reborrow for SharedReadWrite at alloc993, but parent tag <untagged> does not have an appropriate item in the borrow stack