Question about atomics memory ordering: SeqCst needed or not?

I stumbled upon a problem with atomics memory ordering that I can't figure out if it's actually correct or not, and if not - why. I tried to distill the essence in the code below. (Disclaimer: It's an artificial code example and probably not the best code - I'm still on a learning journey.)

The gist of the program is: Two threads modify data (just increment) in a racy manner, but the critical section is protected by a lock that consists of two counters, one counter for each thread where an odd value represents 'locked' state and an even value the 'unlocked' state. From my understanding of the memory ordering docs for atomics [std::memory_order - cppreference.com] the program should be correct and all assertions succeed.

If I run this program, it succeeds in playground and my testing machine. Also cargo thread sanitizer (-Zsanitizer=thread) and address sanitizer (-Zsanitizer=address) are happy. Just, in Miri the program fails where the interpreter performs some order of execution such that assertions fail.

Only if I change the lock operation (which is vec_lock[myself].fetch_add(1, Ordering::AcqRel);) and the check (which is vec_lock[other].load(Ordering::Acquire) % 2 == 1) to use SeqCst also Miri succeeds.

However, I can't help myself but I think the program as shown below is correct as is (independent of arch or anything - just according to spec) because I think the read/write barriers should be sufficient here to ensure correct ordering and inter-thread visibility of lock and check. This would mean Miri reports a false positive here and that's where I have doubts.

Can anybody help to clarify and correct my mental model of memory ordering? Thank you in advance!

use std::sync::Arc;
use std::sync::atomic::AtomicBool;
use std::sync::atomic::AtomicUsize;
use std::sync::atomic::Ordering;
use std::thread;
use std::time::Duration;

fn main() {
    let vec_lock = Arc::new(vec![AtomicUsize::new(0), AtomicUsize::new(0)]);
    let data = Arc::new(AtomicUsize::new(0)); // data to be modified with increments by one
    let stop = Arc::new(AtomicBool::new(false));

    let two_threads = (0usize..2)
        .map(|myself| {
            let vec_lock = vec_lock.clone();
            let data = data.clone();
            let stop = stop.clone();
            thread::spawn(move || {
                let mut success: usize = 0;
                let mut failure: usize = 0;
                let other = (myself + 1) % 2;
                while !stop.load(Ordering::Relaxed) {
                    vec_lock[myself].fetch_add(1, Ordering::AcqRel); // lock (even -> odd)

                    if vec_lock[other].load(Ordering::Acquire) % 2 == 1 {
                        // other is already locked (We count this just to see, if threads really interleave.)
                        failure += 1;
                    } else {
                        // critical section: perform a non-atomic increment (to provoke and test for race condition)
                        let val = data.load(Ordering::Relaxed);
                        // thread::sleep(Duration::from_millis(10)); // sleep, if you want more confidence
                        assert_eq!(val, data.swap(val + 1, Ordering::Relaxed));
                        success += 1;
                    }

                    vec_lock[myself].fetch_add(1, Ordering::Release); // unlock (odd -> even)
                }
                (success, failure)
            })
        })
        .collect::<Vec<_>>();

    thread::sleep(Duration::from_millis(1000));
    stop.store(true, Ordering::Relaxed);
    let sum = two_threads
        .into_iter()
        .enumerate()
        .map(|(id, t)| {
            let (success, failure) = t.join().unwrap();
            println!("{id}: {success} {failure}");
            success
        })
        .sum::<usize>();

    println!("{}", data.load(Ordering::Acquire));
    assert_eq!(sum, data.load(Ordering::Acquire));
}

(Playground)

It somehow succeeded when I changed the locked check to vec_lock[other].fetch_add(0, Ordering::Acquire) % 2 == 0. No explanation yet.

Ah, that explains it.

Those two reads can apparently be reordered.

Miri is right.

thread-0 set lock-0 to odd
thread-0 read lock-1 as even
thread-0 read val N
thread-1 set lock-1 to odd
thread-1 read lock-0 as even (as no requirement to read latest value.)
thread-1 read val N
thread-1 swap set val to N+1
thread-0 swap N+1 and get back N+1

Note: on hardware the two locks are on the same cache line so will always be in sync to matched values.

1 Like

Well, I think the two loads cannot be reordered due to the Acquire, see [std::memory_order - cppreference.com] > Constants > memory_order_acquire:

A load operation with this memory order performs the acquire operation on the affected memory location: no reads or writes in the current thread can be reordered before this load. All writes in other threads that release the same atomic variable are visible in the current thread (see Release-Acquire ordering below).

The "latest" is a bit vague. You're right that executions can be delayed, but - in my understanding - correctness comes from the ordering guarantees. Therefore, the first thread succeeds but the second (and also any other thread if we run the same program for any number of threads) cannot because the lock write (i.e., even to odd) is visible on the check, see [std::memory_order - cppreference.com] > Constants > memory_order_release:

A store operation with this memory order performs the release operation: no reads or writes in the current thread can be reordered after this store. All writes in the current thread are visible in other threads that acquire the same atomic variable (see Release-Acquire ordering below) ...

(This is irrelevant in Rust because we don't have Ordering::Consume) ... and writes that carry a dependency into the atomic variable become visible in other threads that consume the same atomic (see Release-Consume ordering below).

In my understanding, it means (given x == y == 0) if thread T1 executes: x.store(42, Relaxed); y.store(1, Release); then thread 2 will always succeed with if y.load(Acquire) == 1 { assert_eq!(x.load(Relaxed), 42); }. Which means, I understand that all writes includes the x.store() before the Release.

but is allowed to see y.load(Acquire) == 0, which a read-modify-write operation prevents.

A cache alignment doesn't change "much". It can make the program just faster by avoiding false sharing.

Also, as I mentioned it works for any number of threads, I modified the example accordingly with the same behavior that I described in the OP, i.e, Miri fails and all executions and sanitizers succeed:

use std::ops::Deref;
use std::sync::Arc;
use std::sync::atomic::AtomicBool;
use std::sync::atomic::AtomicUsize;
use std::sync::atomic::Ordering;
use std::thread;
use std::time::Duration;

fn main() {
    #[repr(align(128))]
    struct AliAtomicUsize(AtomicUsize);

    impl AliAtomicUsize {
        fn new(value: usize) -> Self {
            AliAtomicUsize {
                0: AtomicUsize::new(value),
            }
        }
    }

    impl Deref for AliAtomicUsize {
        type Target = AtomicUsize;
        fn deref(&self) -> &Self::Target {
            &self.0
        }
    }

    const NUM_THREADS: usize = 3;
    let vec_lock = Arc::new(
        (0..NUM_THREADS)
            .map(|_| AliAtomicUsize::new(0))
            .collect::<Vec<_>>(),
    );
    let data = Arc::new(AtomicUsize::new(0));
    let stop = Arc::new(AtomicBool::new(false));

    let threads = (0usize..NUM_THREADS)
        .map(|myself| {
            let vec_lock = vec_lock.clone();
            let data = data.clone();
            let stop = stop.clone();
            thread::spawn(move || {
                let mut success: usize = 0;
                let mut failure: usize = 0;
                while !stop.load(Ordering::Relaxed) {
                    vec_lock[myself].fetch_add(1, Ordering::AcqRel); // lock (even -> odd)

                    if (0..NUM_THREADS)
                        .filter_map(|id| match id {
                            other if other == myself => None,
                            other => Some(vec_lock[other].load(Ordering::Acquire) % 2 == 1),
                        })
                        .any(|b| b)
                    {
                        // other is already locked (We count this just to see, if threads really interleave.)
                        failure += 1;
                    } else {
                        // critical section: perform a non-atomic increment (to provoke and test for race condition)
                        let val = data.load(Ordering::Relaxed);
                        thread::sleep(Duration::from_millis(1));
                        let _val = data.swap(val + 1, Ordering::Relaxed);
                        assert_eq!(val, _val);
                        success += 1;
                    }

                    vec_lock[myself].fetch_add(1, Ordering::Release); // unlock (odd -> even)
                }
                (success, failure)
            })
        })
        .collect::<Vec<_>>();

    thread::sleep(Duration::from_millis(1000));
    stop.store(true, Ordering::Relaxed);
    let sum = threads
        .into_iter()
        .enumerate()
        .map(|(id, t)| {
            let (success, failure) = t.join().unwrap();
            println!("{id}: {success} {failure}");
            success
        })
        .sum::<usize>();

    println!("{}", data.load(Ordering::Acquire));
    assert_eq!(sum, data.load(Ordering::Acquire));
}

(Playground)

Output:

0: 42 2528117
1: 54 2605109
2: 32 3154568
128

Errors:

   Compiling playground v0.0.1 (/playground)
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.59s
     Running `target/debug/playground`

Right, but that's not the point. The crucial part is that if thread 2 reads y == 1 it must see x == 42. If it reads y == 0 it may see x == 0 or x == 42.

The sanitizer can only uncover a subset of issues. Same with miri, but in this case catches your code.

Apply this to your original.
The "must see" is the failure-counting branch.

Even if you didn't give an explanation I think your answer is right, so I will mark it as such. For anybody stumbling upon a similar issue, here my explanation of what was wrong in my mental model:

Given the following execution of threads T1 and T2 where initially x == y == 0:

  • T1: x.fetch_add(1, AcqRel); let r1 = y.load(Acquire);
  • T2: y.fetch_add(1, AcqRel); let r2 = x.load(Acquire);

I assumed we cannot observe r1 == r2 == 0, but in fact we can (on a weakly ordered architecture such as ARM). All the executions on x86 including sanitizer executions were correct but just because x86 is strongly ordered where the cache coherence enforces that the load reads the result of the fetch_add operation even across threads. However, since language definition of C++/Rust is independent from architecture, it is undefined behavior from the language perspective such that Miri correctly triggers a failure. In conclusion, Miri is right that the Rust program is unsound (even if it could never fail on x86).

Some deeper explanation of the subtle ordering issue:

The example execution above with fetch_add is different from the typical standard example using store/load combination (again initially x == y == 0):

  • T1: x.store(1, Release); let r1 = y.load(Acquire);
  • T2: y.store(1, Release); let r2 = x.load(Acquire);

Here, the combination of Release and Acquire actually allows a reordering of the store and load within each thread, which therefore allows any combination of r1, r2 in {0, 1} including r1 == r2 == 0. And for this combination of store/load even on x86! (You can test and provoke it with the code from the OP by exchanging the fetch_add to store(1, Release) for lock and store(0, Release) for unlock. It will fail assertions by some chance also on x86.)

In contrast, the fetch_add version from above performs a read barrier such that the load cannot be reordered/sequenced before the fetch_add. On a strictly ordered architecture, such as x86, the read and write barrier of the fetch_add is (or at least seems to me) sufficient that the program is always correct because the cache coherence will make sure that the subsequent loads read the latest value. I imagine it as the MESI protocol kicks in to flush CPUs write buffer and (I)nvalidates the cache line for the other threads before the fetch_add completes such that subsequent loads read the update-to-date value, which happens on x86 with the barrier that comes with AcqRel but on ARM it's done only with an additional instruction that comes only with SeqCst ordering atomics. So @jonh was right that from the spec both threads can see non-up-to-date values (for "the success branch of the lock") if the operations are not SeqCst - at least on a weakly ordered architecture. The important part in the [std::memory_order - cppreference.com] > Release-Acquire Ordering is:

This promise only holds if B actually returns the value that A stored, or a value from later in the release sequence.

If you see anything wrong in my explanation, feel free to leave a comment/correction.

Note: I don't give SO answers (specially when there are multiple replies) as such reading prior posts is needed.

Just a pedantic nit.

It (as shown) is not UB (data race), only a race condition, hence no unsafe code.

An alternative explanation. The T1 x.fetch with Acq synchronises with the thread(Z) that set x to 0 with Release such that subsequent loads will give a value for y that is at least what thread(Z) observed at the time it stored the 0 to x. i.e. there is no synchronises with T2