When should one use `Relaxed` memory ordering

When I first started learning about atomics, I used to think it was fine to blindly use Relaxed memory ordering if only a single variable was involved. But after thinking about it more, I feel that’s not actually true...it also depends on what I want to achieve with that atomic variable.
For example, if we need some information (e.g., “please stop”) to reach another thread, Relaxed ordering might not be a good choice, right? Since there would be no happens before relationship between the threads (assuming all the joins happen after spawning all the threads), in that case we might want to use Release and Acquire.

The only guarantee we sort of get when using Relaxed is that, after all the atomic operations on a variable are done, we will see what we were expecting, correct?

The memory ordering doesn't affect how quickly a change will be visible to other threads. Rather it affects whether writes to other memory locations will be visible to another thread if it saw the atomic write. As such for a please stop flag the memory ordering only has a performance effect, not a correctness effect. You need release + acquire if you are for example sending memory to another thread with unsafe code or if you have a bunch of atomic variables where the actual order those atomic variables get updated in matters. If you are sending memory to another thread using eg Mutex, this will establish a happens-before relationship internally already.

6 Likes

The usual example of relaxed ordering is for metrics, counters, etc. For instance, if you need to process 100 items across several threads, each thread can share an atomic counter and increment it atomically with the relaxed ordering to learn which item to process next. The relaxed ordering is enough to ensure that no two threads ever see the same value when they increment the counter. But there are more subtle places where it is appropriate. Look at the implementation of synchronization primitives in std and grep for Relaxed.

3 Likes

Also see loom which can help with finding cases where your specified ordering causes trouble.

1 Like

Obligatory mention: Rust Atomics and Locks by Mara Bos

10 Likes

One gotcha to look out for here is that on x86 a lot of "incorrect" choices in memory ordering (for example only ever using Relaxed) will be "fine" because the ISA is strongly ordered, but if you ran the same code on a weakly ordered ISA (like ARM for example) it would suddenly no longer work. It is very easy to naively write atomic code on x86 that seems to work but is in fact not actually correct, when strictly speaking.

Keep in mind as well that the Rust memory model (which is not so well defined and mostly based on the C++ memory model atm) is orthogonal to the actual underlying architecture you're targetting. You can have code that is unsound according to the Rust abstract machine that "happens" to work on a more strongly-ordered ISA. And indeed, memory ordering is really two distinct things, because its both the compiler and the actual CPU can reorder operations.

This is a good explanation, in addition to the Mara Bos book linked by @scottmcm above which is also great.

2 Likes

hello, can you elaborate a bit more on this: why introducing unsafe code requires stronger guarantees (acquire + release)? thanks!

so an increment by a thread will be visible to other threads under Relaxed memory ordering?

Oo, thank you for this detail!

@zugzwanged You should really check out this book, it's so, so good. Mara has done an awesome job of condensing this very complex topic. It's a really great book.

1 Like

Yes that is correct. However if you have two distinct memory locations (var1 & var2) and update them both in say thread-A like:

var1.fetch_add(1, Ordering::Relaxed);
var2.fetch_add(1, Ordering::Relaxed);

then relaxed is insufficient to ensure something like "If thread-B sees that var2 is updated, then it will definitely see that var1 is updated as well". Theres not a super-concise explanation though, its a bit of a complex topic.

1 Like

yeah, right. For this we'd require acquire and release to form a happens before relationship. right?

Relaxed memory ordering guarantee a total modification order of each individual atomic variable. This means that all modifications of the same atomic variable happen in an order that is the same from the perspective of every single thread.

~ from the atomics and locks book

But how does agreeing on a single modification order for every thread help with correctness?
what if the order was different?

I think the important part of your quote is "each individual atomic variable" :laughing:
This is what people usually expect relaxed atomic accesses to do:
a relaxed write to a place (say, an AtomicU8's location) is immediately visible through atomic reads from that place (that AtomicU8's location) of any ordering.
This differs a nonatomic access (simply called memory access): a nonatomic write to a place is immediately visible through all reads from that place in the same thread only.
Relaxed access again differs from stronger atomic accesses (i.e. Acquire, Release, AcqRel, SeqCst): between any two threads, memory writes in the same thread as, but sequenced before a Release write to a place are immediately visible to memory reads in the same thread as, but sequenced after an Acquire read from that place, whereas Relaxed accesses have no effect on memory accesses. This relationship between memory accesses in a Releaseing thread and memory accesses in a Acquireing thread is apparently called "happens-before", or "memory accesses before the Release write happens-before memory accesses after the Acquire read".

1 Like

It is not that the unsafe code itself requires stronger guarantees. Rather all safe methods of moving memory to another thread (Mutex, RwLock, thread::spawn, JoinHandle::join, channels) already establish a happens-before relationship. Any unsafe code that moves memory also has to establish a happens-before some way or another for soundness. Either by using those safe methods of moving memory or by using atomics and using the right orderings.

1 Like

This is valid only if I run the following as a single atomic instruction i.e. RMW, example fetch_add, correct?

Yeah, understandable

As usual good question contains half of the answer and bad question contains half of the good question.

You are asking almost the right question. But not quite.

Because correct question would be:

Just one letter… yet such a difference. This about physical reality: each CPU core have its own memory, it's own variables and so on.

The majority of them are the same (each CPU core normally have few kilobytes of RAM wbile memory of modern systems is measured in gigabytes) — but enough to cause problems.

And that why it's important to, sometimes, establish relationship with other variables. CPU uses special protocol to ensure that all CPUs agree on the value of one, atomic, variable, but it doesn't gurantee that writes to one varible (multiple copies, because there are many cores!) are synchronized with writes (again: multiple copies because there are multiple cores!) to other variables.

And for many algorithms you kinda need an additional guarantees about ordering of different operations.

E.g. even with just simple trivial mutex: without “happens before” one thread may “take the lock” (mutex is one atomic variable plus cetain protocol) read something from your object, then “release” it and another thread would do the same… but this slow and unweildy special dance takes dozens or, sometimes, hundreds, of CPU ticks… without “happens before” contend of the mutex may well be processed in wrong order, as if lock haven't existed at all! And it may even be another mutex, without another atomic variable.

People are different but for me it was always easier to imagine how things happen in reality, how physics dictates our decisions than read “arbitrary rules in a book”. The trick is not to read about concrete detail about concrete CPU but only think about what physical limitations dictate. Otherwise you risk building mental model that's holding only for one kind of CPUs (and remember that x86 is a weirdo, many things that we think about as “normal” are only true for one very old architecture.

1 Like

Instead of talking in the abstract about the formal properties of memory orderings, to really learn this stuff you have to read and write implementations of concurrency primitives. Other people have recommended Mara Bos's book. That's good; so are C++ Concurrency in Action (Rust is substantially similar), the book by Michael Scott on shared-memory synchronization, The Art of Multiprocessor Programming, and Dmitry Vyukov's 1024cores. In terms of implementations look at std, parking_lot, and the crossbeam channels.

Try to write a SPSC queue. There's a good CPPCON talk on YouTube about this.

3 Likes

I'm not exactly sure what you mean but let me assume you meant one of these two meanings:

  1. Is the relaxed write only immediately visible to atomic reads within a single atomic instruction?
  2. Is the relaxed write only immediately visible to other atomic reads?

So to answer both:

  1. This is a bit of a weird question. Wouldn't the relevant kind of instruction be write-modify-read? :smiley: Anyway, not only.
  2. When you read non-atomically from a place written to by other threads and not synchronized by a happens-before it is a data race so yes.

I am so sorry for not looking at your initial query. Let me try to actually answer your question. That said, it might be a swell idea to go straight to reading Ch. 3..=6 of Rust Atomics and Locks by Mara Bos instead.
It is indeed fine to blindly use Relaxed memory ordering if only a single variable is involved. But when is only a single variable involved? Let's take your example: we need to deliver "please stop" to another thread.

trivial version:

let at_var = AtomicBool::new(true);
thread::scope(|s|
{
  s.spawn(|| while at_var.load(Ordering::Relaxed) { /* stuff */ });
  /* stuff */
  at_var.store(false, Ordering::Relaxed);
});

And let's say both /* stuff */ blocks does not trigger undefined behavior and halts :grinning_face:, then this is totally fine, as at_var.load(Ordering::Relaxed) would eventually load at_var.store(false, Ordering::Relaxed).
But does the code in question really look like this? Often we write:

less trivial version:

let at_var = AtomicBool::new(true);
/* stuff */
thread::scope(|s|
{
  s.spawn(||
  {
    while at_var.load(Ordering::Relaxed) { /* stuff */ }
    /* stuff */ 
  });
  /* stuff */
  at_var.store(false, Ordering::Relaxed);
});

Taking from yarn's example, this could look like:

let at_var = AtomicBool::new(true);
let ot_var = AtomicUsize::new(0);
thread::scope(|s|
{
  s.spawn(||
  {
    while at_var.load(Ordering::Relaxed) {}
    assert_eq!(ot_var.load(Ordering::Relaxed), 42);
  });
  ot_var.store(42, Ordering::Relaxed);
  at_var.store(false, Ordering::Relaxed);
});

The assertion can actually fail! This is because the compiler has freedom to reorder the two relaxed atomic loads and the two relaxed atomic stores based on some separation assumptions, and out-of-order hacks done by a processor may be seen by another. So we need to ask both the compiler and the hardware to behave.
In rust, we do this by:

let at_var = AtomicBool::new(true);
let ot_var = AtomicUsize::new(0);
thread::scope(|s|
{
  s.spawn(||
  {
    /* reads after this acq-load stay after it */
    while at_var.load(Ordering::Acquire) {}
    assert_eq!(ot_var.load(Ordering::Relaxed), 42);
  });
  ot_var.store(42, Ordering::Relaxed);
  /* writes before this rel-store stay before it */
  at_var.store(false, Ordering::Release);
});

Thus in this example, there's not only one variable involved in our delivery of the "please stop" signal. There's a separate payload :person_facepalming:
So why would we go through the trouble of figuring out whether there's more than one variable involved, rather than going Acquire for reads, Release for writes and AcqRel for read-modify-write? Practically, that is because relaxed accesses preserve memory hierarchies for data accesses so hopefully less loads from unified memory are done than everything as AcqRel as possible.

big edit: apparently C11 (where Rust's atomics are from) may have restrictions on what "signal handlers" are [7.14.1..=2]: it might as well have nothing to do with preemption; The Rustonomicon also has an open issue about misconceptions with hardware causes of data racing so don't read it I guess :frowning:

1 Like