Atomic memory ordering question - regarding a thread "stop" flag

So, I have an application that starts multiple "worker" threads.

Usually, each "worker" thread simply finishes after it has done all of its assigned work. And the "main" thread will join() all of the "worker" threads to ensure that everything is completed.

But, sometimes, it can happen that one "worker" thread encounters an error and returns early. In this case, we want to signal to the other "worker" threads that they should abort ASAP, because it would not make sense to wait for the other "worker" threads to actually finish their work, when we already know that something failed. So what we need is some sort of global "stop" flag.

A common way to implement this is by using an AtomicBool:

static STOP: AtomicBool = AtomicBool::new(false);

fn worker_thread() {
    while !STOP.load(...) {
        /*
        do the next chunk of work here, if not stopped yet!
        */

        if error_happened {
            STOP.store(true, ...);
            break;
        }
    }
}

Now, the big question is: which Ordering should be used here?


I have read various sources on this topic, but the information is inconclusive :exploding_head:

Some sources say that Ordering::Relaxed is sufficient for this use case, because atomic operation always are "atomic", and the memory ordering only becomes relevant, if we need to synchronize the atomic operation with other loads and/or stores. Since this is not the case here, we can simply use the weakest (fastest), memory ordering, i.e., Ordering::Relaxed.

But other sources claim that if we update (store) the "stop" flag with Ordering::Relaxed, or if we test (load) the "stop" flag with Ordering::Relaxed, then the updated value that is stored by one thread may not become "visible" to other threads for an indefinite amount of time – possibly never. To ensure that the updated value becomes visible to other threads soon, we need to do the store with Ordering::Release, and we need to do the loads with Ordering::Acquire.

Which statement is correct? :confused:

I would like to stick with Ordering::Relaxed, especially for the load, because this happens at high frequency in the "worker" threads, so we really want this to be as fast as possible. And it seems to work fine in my tests. But this, of course, is not a proof that it will always work...

Best regards.

1 Like

The ordering does not change how quickly the value becomes visible to other threads. Relaxed is sufficient.

5 Likes

Is this really true for all CPU architectures?

I have read that x86 has a "strong" memory model, and therefore Relaxed just happens to work like Release/Acquire on that architecture. But this is going to break on architectures with a "weak" memory model, specifically ARM, where the updated value may not become "visible" to other threads for an indefinite amount of time (possibly never) when using Relaxed :worried:

For the low-level correctness of a multithreaded algorithm, it's best to make minimal assumptions about how threads work, and yeah, hypothetically, Ordering::Relaxed could never become visible to other threads. If Ordering::Relaxed being too slow could cause a race condition, that's bad.

But in practice, such a situation (never becoming visible from other threads) would be absurd.

Similarly, in general, any thread could be interrupted at any time and take an indefinitely long time to be scheduled to run again. But in practice, even though a thread scheduler is unlikely to do exactly what you want, it'll probably schedule threads well enough.

In other words, you can rely on the relaxed atomic working at a high level, so long as memory safety (and whatnot) isn't at risk from the exact timing behavior.

1 Like

By the memory model it is also allowed that a Release write never becomes visible to any other thread. The memory ordering only guarantees the order in which changes to other memory locations than the written/read become visible.

It is wishful thinking and not part of reality. (Rust does not deal in the obscure.) You have more chance of generating the same random UUID with known good randomness. You should be taking the random UUIDs your code generates as guaranteed to be unique. Errors happening are common (or uncommon but to be expected) occurrences, so good code always deals with them.

Acquire/Release make no difference for what you write. You can use other atomic operations than load/store to make it more likely to become visible but in reality it will have no impact or just slow down you program.

p.s. Maybe better not using a global if code is substantial and long lived.

What is the conclusion of all of this? :thinking:


Again: In this use case (i.e., a simple thread "stop" flag), I do not need the "atomic" operation to have a specific order with respect to other load/store operations. In other words, I do not need the synchronize-with relationship, as it is depicted here:


(Source)

This would indicate that Ordering::Release/Ordering::Acquire is not required here, and therefore Ordering::Relaxed would be sufficient, right?


On the other hand, what I do need is that the updated STOP value that was stored by one thread becomes "visible" to other the threads soon™.

It does not need to happen immediately – a certain smallish delay is acceptable, especially if this allows the frequent checks (loads) to be implemented in a more efficient (faster) way. Nonetheless, the "visibility" of the new value to other threads must not be delayed for "too long" or even indefinitely, because this would clearly contradict the purpose of a STOP flag :grimacing:


What is still unclear to me:

  • Is delayed "visibility" of the updated STOP value (to other threads) even a real issue that I need to worry about? If it is not an issue on the x86 architecture, because of its "strong" memory model, can it still be a real issue on other architectures, such as ARM?

  • Does using Ordering::Release/Ordering::Acquire provide any benefit, compared to using Ordering::Relaxed, with respect to visibility delays? If it does not help (is not required) on the x86 architecture, does it maybe help on the ARM architecture?

  • If there is no proper way to avoid possible indefinite visibility delays with AtomicBool , then what would be the better alternative? Would RwLock<bool> be reasonable?

(I can only test on the x86 architecture, because I do not currently have an ARM PC at hand)

Regards.

1 Like

Mutex and RwLock (and every other possible synchronisation data structure) uses Atomics internally. So this can't be faster than just using Atomics directly.

I just found GitHub - nviennot/core-to-core-latency: Measures the latency between CPU cores and while i don't know if their methodology is good the results of a couple of nanoseconds seems realistic to me. They also include results of Arm processors and the results don't differ much.

Sure, a Mutex<bool> or RwLock<bool> is slower (less efficient) than an AtomicBool because they acquire an explicit lock on the “wrapped” variable, before they let me access it. But, a Mutex<bool> or RwLock<bool> should guarantee that, once I acquire the lock and read the value, I see the “latest” value. Meanwhile, reading (loading) an AtomicBool always happens immediately (lock free), but it may see an “old” value, e.g., from the CPU core's local cache. It is not clear how long it may take for the “atomic” load to actually see the “latest” (updated) value. Some sources say, visibility of the update could be delayed “indefinitely” – possibly forever.

This doesn't matter here. They want a thread stop flag. This is set once and then the threads stop, so it doesn't change again.

A thread stop flag is the perfect usecase for AtomicBool.

AFAIK, no. All ISAs guarantee that a relaxed write from one core will be eventually seen by relaxed reads from all other cores. The max delay is dependent on low level implementation details of CPU/memory (and in the case of threads on OS scheduler implementation), but for most practical purposes it can be considered small enough.

For example, from this SO answer:

A precise statement of this property is somewhat hard to find in the ARMv8 spec, but we do have in B2.7.1 "A write to a memory location with the Normal attribute completes in finite time", and further down, "Each Inner Shareability domain contains a set of observers that are data coherent for each member of that set for data accesses with the Inner Shareable attribute made by any member of that set.". Unfortunately they do not seem to give a precise definition of "data coherent", but it surely must include "loads eventually observe stores".

1 Like

So, the bottom line is that memory Ordering is completely unrelated to how quickly the update to the atomic variable will become "visible" to other threads – because the memory Ordering only concerns which "happened-before" guarantees we get, or do not get, once the update has becomes visible; it says nothing about the case where the update has not become visible yet. Therefore, if we do not need to create a synchronize-with edge, as in the picture above, then there is no reason to use anything other than Ordering::Relaxed. Correct?

Furthermore, in any case, and unrelated to the memory Ordering, there is no guarantee how quickly an update to the atomic variable will become "visible" to other threads. There is nothing in the standard that defines an upper bound to the delay. Hypothetically, this means that the "visibility" of the updated value could be delayed indefinitely – which means that, in the worst case, the update could never become "visible" to some threads.

Nonetheless, there appears to be a common understanding that, at least on all "modern" ISAs, there is some sort of "best effort" policy to make updates "visible" to all threads within a reasonable time frame. So, while we must not assume a fixed upper bound for the delay, it is safe to assume that the update will become "visible" reasonably soon. D'accord? :thinking:


Today, I have implemented a little benchmark program to measure the delay, and here are the result from my "x86" (64-bit) Linux system:

[relaxed][00000000] min: 00000111, mean: 00000479.9, stddev: 00029376.5, median: 00000161, max: 04304723
[relaxed][00000001] min: 00000110, mean: 00000433.6, stddev: 00031207.3, median: 00000160, max: 07186827
[relaxed][00000002] min: 00000113, mean: 00000567.5, stddev: 00013137.5, median: 00000160, max: 01317570
[relaxed][00000003] min: 00000113, mean: 00000367.4, stddev: 00020531.0, median: 00000160, max: 03129108
[relaxed][00000004] min: 00000111, mean: 00000483.9, stddev: 00010068.5, median: 00000160, max: 01417599
[acq+rel][00000000] min: 00000128, mean: 00000499.8, stddev: 00009869.3, median: 00000164, max: 01971176
[acq+rel][00000001] min: 00000127, mean: 00000471.5, stddev: 00008719.4, median: 00000165, max: 01433011
[acq+rel][00000002] min: 00000128, mean: 00000514.0, stddev: 00024493.9, median: 00000164, max: 05176405
[acq+rel][00000003] min: 00000126, mean: 00000337.8, stddev: 00005475.4, median: 00000164, max: 01142486
[acq+rel][00000004] min: 00000128, mean: 00000310.0, stddev: 00003482.5, median: 00000164, max: 00281860

And here are the results from a macOS (Apple m2) system, that I had to the chance to run the benchmark program on:

[relaxed][00000000] min: 00000000, mean: 00000124.8, stddev: 00000653.7, median: 00000083, max: 00030084
[relaxed][00000001] min: 00000000, mean: 00000088.3, stddev: 00000122.2, median: 00000083, max: 00015791
[relaxed][00000002] min: 00000000, mean: 00000086.1, stddev: 00000086.1, median: 00000083, max: 00011750
[relaxed][00000003] min: 00000041, mean: 00000087.2, stddev: 00000155.6, median: 00000083, max: 00020833
[relaxed][00000004] min: 00000000, mean: 00000092.4, stddev: 00000238.4, median: 00000083, max: 00015083
[acq+rel][00000000] min: 00000000, mean: 00000098.7, stddev: 00000189.5, median: 00000083, max: 00013334
[acq+rel][00000001] min: 00000000, mean: 00000088.8, stddev: 00000149.3, median: 00000083, max: 00010958
[acq+rel][00000002] min: 00000000, mean: 00000101.7, stddev: 00000308.4, median: 00000083, max: 00017042
[acq+rel][00000003] min: 00000000, mean: 00000088.4, stddev: 00000127.1, median: 00000083, max: 00013041
[acq+rel][00000004] min: 00000000, mean: 00000087.6, stddev: 00000131.3, median: 00000083, max: 00016625
Source
/* SPDX-License-Identifier: CC0-1.0 */

use rand_chacha::{ChaCha20Rng, rand_core::SeedableRng};
use scrypt::{
    Params, Scrypt,
    password_hash::{CustomizedPasswordHasher, phc::SaltString},
};
use std::{
    hint::{black_box, spin_loop},
    sync::{Arc, Barrier, LazyLock, Mutex, atomic::AtomicUsize},
    thread,
    time::Instant,
};

#[cfg(windows)]
use windows::Win32::Media::{timeBeginPeriod, timeEndPeriod};

#[cfg(feature = "acqrel")]
mod memory_ordering {
    use std::sync::atomic::Ordering;
    pub const STR: &str = "acq+rel";
    pub const ORDERING_LD: Ordering = Ordering::Acquire;
    pub const ORDERING_ST: Ordering = Ordering::Release;
}

#[cfg(not(feature = "acqrel"))]
mod memory_ordering {
    use std::sync::atomic::Ordering;
    pub const STR: &str = "relaxed";
    pub const ORDERING_LD: Ordering = Ordering::Relaxed;
    pub const ORDERING_ST: Ordering = Ordering::Relaxed;
}

#[inline]
fn do_some_work() {
    static PARAM: LazyLock<Params> = LazyLock::new(|| Params::new(5u8, 8u32, 1u32).unwrap());
    static RAND: LazyLock<Mutex<ChaCha20Rng>> = LazyLock::new(|| Mutex::new(ChaCha20Rng::seed_from_u64(42u64)));
    let salt_string = SaltString::from_rng(&mut RAND.lock().unwrap());
    let digest = Scrypt.hash_password_customized(b"my_secret", salt_string.as_bytes(), None, None, *PARAM).unwrap();
    black_box(digest);
}

fn mean_and_variance(values: &[u128]) -> (f64, f64) {
    let (mut count, mut mean, mut m2) = (0u64, 0.0f64, 0.0f64);
    for value in values.iter().map(|x| *x as f64) {
        count += 1u64;
        let delta = value - mean;
        mean += delta / (count as f64);
        m2 += delta * (value - mean);
    }
    (mean, m2 / ((count - 1u64) as f64))
}

fn thread_entry(ping: &AtomicUsize, pong: &AtomicUsize, barrier: &Barrier) {
    barrier.wait();
    for step in 0..LOOP_COUNT {
        while ping.load(memory_ordering::ORDERING_LD) != step {
            spin_loop();
        }
        pong.store(step, memory_ordering::ORDERING_ST);
    }
}

const LOOP_COUNT: usize = 100003usize;

fn main() {
    #[cfg(windows)]
    unsafe {
        timeBeginPeriod(1u32);
    }

    for counter in 0..=usize::MAX {
        let barrier = Arc::new(Barrier::new(2usize));
        let barrier_cloned = Arc::clone(&barrier);

        let ping = Arc::new(AtomicUsize::new(usize::MAX));
        let ping_cloned = Arc::clone(&ping);

        let pong = Arc::new(AtomicUsize::new(usize::MAX));
        let pong_cloned = Arc::clone(&pong);

        let thread = thread::spawn(move || thread_entry(&ping_cloned, &pong_cloned, &barrier_cloned));
        barrier.wait();

        let mut delays = Vec::with_capacity(LOOP_COUNT);
        for step in 0..LOOP_COUNT {
            do_some_work();
            let instant_started = Instant::now();
            ping.store(step, memory_ordering::ORDERING_ST);
            while pong.load(memory_ordering::ORDERING_LD) != step {
                spin_loop();
            }
            let elapsed = instant_started.elapsed();
            delays.push(elapsed.as_nanos());
        }

        thread.join().expect("Failed to join() the thread!");
        delays.sort();

        let (mean, variance) = mean_and_variance(delays.as_slice());
        let stddev = variance.sqrt();

        println!(
            "[{}][{:08}] min: {:08}, mean: {:010.1}, stddev: {:010.1}, median: {:08}, max: {:08}",
            memory_ordering::STR,
            counter,
            delays[0usize],
            mean,
            stddev,
            delays[delays.len() / 2usize],
            delays[delays.len() - 1usize]
        );
    }

    #[cfg(windows)]
    unsafe {
        timeEndPeriod(1u32);
    }
}
2 Likes

Memory ordering isn't about how fast other threads see changes, but about the correctness. I will try to explain my (somewhat, limited) understanding.

Performace:
Hardware (x86)
When you load something into memory (by calling MOV for example) it first goes to the CPU buffers, then CPU pushes it to the shared cache, and the cache coherence protocol (g: MESA, MESI etc) spreads it over other core caches (the first moment other cores "see" these changes). Eventually, data hits RAM. It takes time.

Software (non real-time OS)
It is up to a scheduler to decide when to give a quantum of time to any thread. There are no guarantees.

Correctness (this is what a memory model about)
Hardware (x86)
CPU might reorder memory access (both load and store) as long as it has no observable effect.
(pseudocode)

a = 0
a = 42
b = 100
print(a, b)

CPU might set b first. Who cares, as long as boths variables are set at the point print gets called?

At the same time, another core (our CPU isn't aware of) accesses b and sees b == 100 and a == 0 because of reordering. This is incorrect.

To fix it, x86 introduced memory barrier instructions:

  1. MFENCE: Everything that already happend at this moment must be visible to all other cores. If another core sees b set, it must also see a == 42. No reordering, please. This is called "serialization" because all instructions are serialized.
  2. LFENCE: Same but for "load from memory" only. All loads from memory are serialized, but not writes.
  3. SFENCE. "s" stands for "store", so you can probably guess the meaning.

Rust atomic memory ordering (based heavily on C++)
Compiler also does reordering, we need a tool to fix both: compiler and CPU reordering. This is when memory order of atomic operations comes into play.

  1. Relaxed: No fences, just atomic instruction. Other threads see your atomic values, but might not see other things that happened before.
  2. Release and Acquire. Atomic variable acts as a LFENCE and SFENCE. Everything that happend before you wrote a new value with Release, will be visible to a thread, which reads your variable with Acquire.
  3. SeqCst is like MFENCE: everything is serialized.

In your particular case you do not care about other variables, so Relaxed should be ok.

2 Likes

Thanks! That largely agrees with what I have come to understand.

Performace:
Hardware (x86)
When you load something into memory (by calling MOV for example) it first goes to the CPU buffers, then CPU pushes it to the shared cache, and the cache coherence protocol (g: MESA, MESI etc) spreads it over other core caches (the first moment other cores "see" these changes). Eventually, data hits RAM. It takes time.

The problem, IMO, is that the standard does not mandate an upper bound on how long it may take until the update to the atomic variable becomes visible to other threads. As you say, memory ordering can only provide certain "happened-before" guarantees that apply once the update has become visible. So, if the standard does not mandate an upper limit on the delay, then it would be perfectly "allowable" for the hardware to indefinitely delay the visibility of the update.

Hypothetically, the update may never become visible.

There seems to be a common belief that, on real hardware, the update actually does become visible within some "reasonable" time frame. But, with regard to writing "portable" code, it would be great to have some sort of assertion (preferably in the standard) for this :slight_smile:

My tests seem to support that the delay is actually very small on real hardware – with some expected variance. But I can only test a few systems, which certainly is not a proper "proof".

Details

You’re not guaranteed that your program will reach the next statement, either. How much effort do you spend addressing that?

1 Like

That would be pretty hard to do. Consider a multi-CPU server system. Now it literally depends on the motherboard layout since that affects signal propagation delay. It will also depend on how much cache traffic there currently is between different parts.

But this delay is going to be a lot lower than the delays that OS thread scheduling introduces when it decides your thread shouldn't be run right now. You could be looking at something like several milliseconds there easily before you get to run again, depending on system load and configuration. So you are worrying about the wrong thing here.

Yes this is true. However hardware is designed not to annoy software developers, but to run software. A high core-to-core latency is bad for performance, so hardware companies try very hard to avoid this. A infinite delay would break every multithreaded software (which includes the OS) so you can be sure that no CPU that is being sold does that.

1 Like