Why is dereferencing Arc<AtomicUsize> much faster than dereferencing *const AtomicUsize?

I need to share AtomicUsize between threads, and was surprised to find that dereferencing Arc<AtomicUsize> was 4 times faster than dereferencing *const AtomicUsize. This is counterintuitive because I thought that they should have similar cost. Any explanations?

use std::{
    sync::{
        atomic::{AtomicUsize, Ordering},
        Arc,
    },
    thread::spawn,
    time::Instant,
};

struct Wrapper(*const AtomicUsize);
unsafe impl Send for Wrapper {}

// `test_ptr()` and `test_arc()` are identical except that
// the former uses *const AtomicUsize while the latter uses Arc<AtomicUsize>
fn test_ptr(threads: usize, count: usize) {
    let mut handles = Vec::with_capacity(threads);
    let values = (0..threads)
        .into_iter()
        .map(|_| AtomicUsize::new(0))
        .collect::<Vec<_>>();
    for v in &values {
        let p = Wrapper(v);
        handles.push(spawn(move || {
            for _ in 0..count {
                (unsafe { &*p.0 }).fetch_add(1, Ordering::Relaxed);
            }
        }));
    }
    while values.iter().any(|v| v.load(Ordering::Relaxed) < count) {}
    for h in handles {
        h.join().unwrap();
    }
}

fn test_arc(threads: usize, count: usize) {
    let mut handles = Vec::with_capacity(threads);
    let values = (0..threads)
        .into_iter()
        .map(|_| Arc::new(AtomicUsize::new(0)))
        .collect::<Vec<_>>();
    for v in &values {
        let p = v.clone();
        handles.push(spawn(move || {
            for _ in 0..count {
                p.fetch_add(1, Ordering::Relaxed);
            }
        }));
    }
    while values.iter().any(|v| v.load(Ordering::Relaxed) < count) {}
    for h in handles {
        h.join().unwrap();
    }
}

fn main() {
    const THREADS: usize = 10;
    const COUNT: usize = 0x100000;
    let now = Instant::now();
    test_ptr(THREADS, COUNT);
    println!("Using *const AtomicUsize took {:?}", now.elapsed());
    let now = Instant::now();
    test_arc(THREADS, COUNT);
    println!("Using Arc<AtomicUsize> took {:?}", now.elapsed());
}

On release build:

Using *const AtomicUsize took 147.140065ms
Using Arc<AtomicUsize> took 32.4519ms

On debug build:

Using *const AtomicUsize took 304.676667ms
Using Arc<AtomicUsize> took 59.330349ms

I am using rustc 1.58.1 (db9d1b20b 2022-01-20)

You have false sharing. Basically atomics work at the cache line (typically aligned 64byte). So different atomics right next to each other can contend if they're on the same cache line. Arcs are separate allocations on the heap and have the atomic counters (making them bigger) so they're not very likely to have false sharing.

Padding the vector of atomics such that each atomic is on a different cache line fixes the performance and beats Arc, but you still have unsafe code. Depending on what you're trying to do, Rayon maybe able to do it safely and even faster by avoiding granular atomics entirely.

use std::{
    sync::{
        atomic::{AtomicUsize, Ordering},
        Arc,
    },
    thread::spawn,
    time::Instant,
};

struct Wrapper(*const AtomicUsize);
unsafe impl Send for Wrapper {}

// `test_ptr()` and `test_arc()` are identical except that
// the former uses *const AtomicUsize while the latter uses Arc<AtomicUsize>
fn test_ptr(threads: usize, count: usize) {
    let mut handles = Vec::with_capacity(threads);
    let values = (0..threads)
        .into_iter()
        .map(|_| AtomicUsize::new(0))
        .collect::<Vec<_>>();
    for v in &values {
        let p = Wrapper(v);
        handles.push(spawn(move || {
            let p = p;
            for _ in 0..count {
                (unsafe { &*p.0 }).fetch_add(1, Ordering::Relaxed);
            }
        }));
    }
    while values.iter().any(|v| v.load(Ordering::Relaxed) < count) {}
    for h in handles {
        h.join().unwrap();
    }
}

// `test_ptr()` and `test_arc()` are identical except that
// the former uses *const AtomicUsize while the latter uses Arc<AtomicUsize>
fn test_ptr_padded(threads: usize, count: usize) {
    let mut handles = Vec::with_capacity(threads);
    let values = (0..threads)
        .into_iter()
        .map(|_| (AtomicUsize::new(0), 0u64,0u64,0u64, 0u64,0u64,0u64, 0u64))
        .collect::<Vec<_>>();
    for v in &values {
        let p = Wrapper(&v.0);
        handles.push(spawn(move || {
            let p = p;
            for _ in 0..count {
                (unsafe { &*p.0 }).fetch_add(1, Ordering::Relaxed);
            }
        }));
    }
    while values.iter().any(|v| v.0.load(Ordering::Relaxed) < count) {}
    for h in handles {
        h.join().unwrap();
    }
}

fn test_arc(threads: usize, count: usize) {
    let mut handles = Vec::with_capacity(threads);
    let values = (0..threads)
        .into_iter()
        .map(|_| Arc::new(AtomicUsize::new(0)))
        .collect::<Vec<_>>();
    for v in &values {
        let p = v.clone();
        handles.push(spawn(move || {
            for _ in 0..count {
                p.fetch_add(1, Ordering::Relaxed);
            }
        }));
    }
    while values.iter().any(|v| v.load(Ordering::Relaxed) < count) {}
    for h in handles {
        h.join().unwrap();
    }
}

fn main() {
    const THREADS: usize = 10;
    const COUNT: usize = 0x100000;
    let now = Instant::now();
    test_ptr(THREADS, COUNT);
    println!("Using *const AtomicUsize took {:?}", now.elapsed());
    let now = Instant::now();
    test_ptr_padded(THREADS, COUNT);
    println!("Using *const AtomicUsize Padded took {:?}", now.elapsed());
    let now = Instant::now();
    test_arc(THREADS, COUNT);
    println!("Using Arc<AtomicUsize> took {:?}", now.elapsed());
}

Using *const AtomicUsize took 89.9593ms
Using *const AtomicUsize Padded took 16.5459ms
Using Arc took 30.5055ms

13 Likes

Here's with rayon note the compiler can now cheat the inner for loop, since it can optimize better without atomics.

fn test_rayon(jobs: usize, count: usize) {
    use rayon::prelude::*;

    let mut values = (0..jobs as u64).collect::<Vec<_>>();

    values.par_iter_mut().for_each(|v| for _ in 0..count{*v+=1;});
}

Using Rayon took 338.7µs

Wow, I didn't think of that. Great answer!

Hmm, I saw it, but the code does compile by rustc 1.58.1 (db9d1b20b 2022-01-20) on the command line. Can you try it? I wonder why the same code compiles with rustc but not on the playground.

I fixed the issue in my version, I think it's from the closure capture change in rust 2021

I deleted my reply as @Cocalus is obviously on the right track here which is proved by the new examples.

I would in any case be wary of using a *const T this way. Atomics basically compile down to some atomic instruction(s)/memory barriers which should be fine when considering thread safety, but I'm not a 100 % sure what guarantees, with respect to compiler optimizations, you get when using a *const T to a value that's both shared and changed (even if it's thread safe to do so). It would be interesting to know the answer to that (I would think that it's UB to do so, but TBH I'm not sure).

It's safe as there isn't change. But better not to even risk it. Internal mutation is distinct and without a bug in compiler it is fine.
Pointer invalidation is the big risk but even though the while values.iter() is potentially above their last use, it just reads. (and the drop is at the end after threads have finished.)

If some mutation were to be added you will likely get some UB.

    let mut a = AtomicUsize::new(0);
    let p = &a as *const AtomicUsize;
    let _r = &mut a;
    let _ = unsafe { &*p }.fetch_add(1, Ordering::Relaxed);
1 Like

Indeed, in the new edition, your closure only moves the pointer in the field, not the whole wrapper, because only the field is accessed.

You can choose 2018 in the playground and the original code works.

This is BTW also a nice case for demonstrating how well the migration warnings and fixes work: see the output of this.

3 Likes

Yeah, you're probably right. And MIRI seems to agree that if you happen to add any mutation it's definitely UB:

error: Undefined Behavior: trying to reborrow for SharedReadWrite at alloc993, but parent tag <untagged> does not have an appropriate item in the borrow stack

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.