So, the bottom line is that memory Ordering is completely unrelated to how quickly the update to the atomic variable will become "visible" to other threads – because the memory Ordering only concerns which "happened-before" guarantees we get, or do not get, once the update has becomes visible; it says nothing about the case where the update has not become visible yet. Therefore, if we do not need to create a synchronize-with edge, as in the picture above, then there is no reason to use anything other than Ordering::Relaxed. Correct?
Furthermore, in any case, and unrelated to the memory Ordering, there is no guarantee how quickly an update to the atomic variable will become "visible" to other threads. There is nothing in the standard that defines an upper bound to the delay. Hypothetically, this means that the "visibility" of the updated value could be delayed indefinitely – which means that, in the worst case, the update could never become "visible" to some threads.
Nonetheless, there appears to be a common understanding that, at least on all "modern" ISAs, there is some sort of "best effort" policy to make updates "visible" to all threads within a reasonable time frame. So, while we must not assume a fixed upper bound for the delay, it is safe to assume that the update will become "visible" reasonably soon. D'accord? 
Today, I have implemented a little benchmark program to measure the delay, and here are the result from my "x86" (64-bit) Linux system:
[relaxed][00000000] min: 00000111, mean: 00000479.9, stddev: 00029376.5, median: 00000161, max: 04304723
[relaxed][00000001] min: 00000110, mean: 00000433.6, stddev: 00031207.3, median: 00000160, max: 07186827
[relaxed][00000002] min: 00000113, mean: 00000567.5, stddev: 00013137.5, median: 00000160, max: 01317570
[relaxed][00000003] min: 00000113, mean: 00000367.4, stddev: 00020531.0, median: 00000160, max: 03129108
[relaxed][00000004] min: 00000111, mean: 00000483.9, stddev: 00010068.5, median: 00000160, max: 01417599
[acq+rel][00000000] min: 00000128, mean: 00000499.8, stddev: 00009869.3, median: 00000164, max: 01971176
[acq+rel][00000001] min: 00000127, mean: 00000471.5, stddev: 00008719.4, median: 00000165, max: 01433011
[acq+rel][00000002] min: 00000128, mean: 00000514.0, stddev: 00024493.9, median: 00000164, max: 05176405
[acq+rel][00000003] min: 00000126, mean: 00000337.8, stddev: 00005475.4, median: 00000164, max: 01142486
[acq+rel][00000004] min: 00000128, mean: 00000310.0, stddev: 00003482.5, median: 00000164, max: 00281860
And here are the results from a macOS (Apple m2) system, that I had to the chance to run the benchmark program on:
[relaxed][00000000] min: 00000000, mean: 00000124.8, stddev: 00000653.7, median: 00000083, max: 00030084
[relaxed][00000001] min: 00000000, mean: 00000088.3, stddev: 00000122.2, median: 00000083, max: 00015791
[relaxed][00000002] min: 00000000, mean: 00000086.1, stddev: 00000086.1, median: 00000083, max: 00011750
[relaxed][00000003] min: 00000041, mean: 00000087.2, stddev: 00000155.6, median: 00000083, max: 00020833
[relaxed][00000004] min: 00000000, mean: 00000092.4, stddev: 00000238.4, median: 00000083, max: 00015083
[acq+rel][00000000] min: 00000000, mean: 00000098.7, stddev: 00000189.5, median: 00000083, max: 00013334
[acq+rel][00000001] min: 00000000, mean: 00000088.8, stddev: 00000149.3, median: 00000083, max: 00010958
[acq+rel][00000002] min: 00000000, mean: 00000101.7, stddev: 00000308.4, median: 00000083, max: 00017042
[acq+rel][00000003] min: 00000000, mean: 00000088.4, stddev: 00000127.1, median: 00000083, max: 00013041
[acq+rel][00000004] min: 00000000, mean: 00000087.6, stddev: 00000131.3, median: 00000083, max: 00016625
Source
/* SPDX-License-Identifier: CC0-1.0 */
use rand_chacha::{ChaCha20Rng, rand_core::SeedableRng};
use scrypt::{
Params, Scrypt,
password_hash::{CustomizedPasswordHasher, phc::SaltString},
};
use std::{
hint::{black_box, spin_loop},
sync::{Arc, Barrier, LazyLock, Mutex, atomic::AtomicUsize},
thread,
time::Instant,
};
#[cfg(windows)]
use windows::Win32::Media::{timeBeginPeriod, timeEndPeriod};
#[cfg(feature = "acqrel")]
mod memory_ordering {
use std::sync::atomic::Ordering;
pub const STR: &str = "acq+rel";
pub const ORDERING_LD: Ordering = Ordering::Acquire;
pub const ORDERING_ST: Ordering = Ordering::Release;
}
#[cfg(not(feature = "acqrel"))]
mod memory_ordering {
use std::sync::atomic::Ordering;
pub const STR: &str = "relaxed";
pub const ORDERING_LD: Ordering = Ordering::Relaxed;
pub const ORDERING_ST: Ordering = Ordering::Relaxed;
}
#[inline]
fn do_some_work() {
static PARAM: LazyLock<Params> = LazyLock::new(|| Params::new(5u8, 8u32, 1u32).unwrap());
static RAND: LazyLock<Mutex<ChaCha20Rng>> = LazyLock::new(|| Mutex::new(ChaCha20Rng::seed_from_u64(42u64)));
let salt_string = SaltString::from_rng(&mut RAND.lock().unwrap());
let digest = Scrypt.hash_password_customized(b"my_secret", salt_string.as_bytes(), None, None, *PARAM).unwrap();
black_box(digest);
}
fn mean_and_variance(values: &[u128]) -> (f64, f64) {
let (mut count, mut mean, mut m2) = (0u64, 0.0f64, 0.0f64);
for value in values.iter().map(|x| *x as f64) {
count += 1u64;
let delta = value - mean;
mean += delta / (count as f64);
m2 += delta * (value - mean);
}
(mean, m2 / ((count - 1u64) as f64))
}
fn thread_entry(ping: &AtomicUsize, pong: &AtomicUsize, barrier: &Barrier) {
barrier.wait();
for step in 0..LOOP_COUNT {
while ping.load(memory_ordering::ORDERING_LD) != step {
spin_loop();
}
pong.store(step, memory_ordering::ORDERING_ST);
}
}
const LOOP_COUNT: usize = 100003usize;
fn main() {
#[cfg(windows)]
unsafe {
timeBeginPeriod(1u32);
}
for counter in 0..=usize::MAX {
let barrier = Arc::new(Barrier::new(2usize));
let barrier_cloned = Arc::clone(&barrier);
let ping = Arc::new(AtomicUsize::new(usize::MAX));
let ping_cloned = Arc::clone(&ping);
let pong = Arc::new(AtomicUsize::new(usize::MAX));
let pong_cloned = Arc::clone(&pong);
let thread = thread::spawn(move || thread_entry(&ping_cloned, &pong_cloned, &barrier_cloned));
barrier.wait();
let mut delays = Vec::with_capacity(LOOP_COUNT);
for step in 0..LOOP_COUNT {
do_some_work();
let instant_started = Instant::now();
ping.store(step, memory_ordering::ORDERING_ST);
while pong.load(memory_ordering::ORDERING_LD) != step {
spin_loop();
}
let elapsed = instant_started.elapsed();
delays.push(elapsed.as_nanos());
}
thread.join().expect("Failed to join() the thread!");
delays.sort();
let (mean, variance) = mean_and_variance(delays.as_slice());
let stddev = variance.sqrt();
println!(
"[{}][{:08}] min: {:08}, mean: {:010.1}, stddev: {:010.1}, median: {:08}, max: {:08}",
memory_ordering::STR,
counter,
delays[0usize],
mean,
stddev,
delays[delays.len() / 2usize],
delays[delays.len() - 1usize]
);
}
#[cfg(windows)]
unsafe {
timeEndPeriod(1u32);
}
}