Yeah, that's what I went looking for more info on.

r/rust - A new diagram-based explanation of Rust/C++20 atomic orderings
146 votes and 11 comments so far on Reddit
Yeah, that's what I went looking for more info on.
I'm considering to open a documentation issue. These are my reasons:
std::sync::atomic
docs claim that "Rust atomics currently follow the same rules as C++20 atomics, […]" while the documentation of std::sync::atomic::Ordering
doesn't give the same guarantees as given in the C++ reference (but less guarantees, because read-modify-write operations as being part of the release sequence are ignored in the Rust docs).Overall, it's very difficult for someone who starts with the Rust documentation to get an overview on what the Rust atomics really do. I think this should/could be improved.
I think it would be nice to improve the documentation, though I would make sure to think about which particular changes you would make, especially re your second point.
I think I don't really have enough knowledge about C++ to make a PR, I simply thought of an issue (and let someone else fix this ).
But I wonder if maybe the Rust documentation deliberately ignores the release sequences due to having more flexibility in future? Maybe at some point the second rule might be dropped too?
Release sequence
[…]
- […]
- Atomic read-modify-write operations made to M by any thread
I don't really know if this is relevant in practice or used by a lot of code. It certainly would be a breaking change to remove this (but removing rule 1 was a breaking change from C++11 to C++20 too, I guess).
In either case, it should be clarified whether the C++ rules really take precedence, and if they do, the Rust docs should be adjusted in that matter (or extended with a warning that the behavior of Rust regarding release sequences may change at any point in future and cannot be relied upon).
Moreso, it's that Rust effectively includes the C++ (or rather, LLVM) atomic model by reference. Rust isn't trying to define its own atomic memory model; it's just exposing the one that already (somewhat) exists.
In general, use of atomics should be limited (within a potentially parallel unit not externally synchronized) to only touching any given atomic with Relaxed
, AcqRel
, or SeqCst
orderings, and not mixing them. Any algorithm mixing the different orderings on a single atomic likely requires a PhD-level understanding of weak memory effects.
It's also worth noting that the std::atomic
docs aren't the only documentation for atomics Rust provides; the API documentation in fact directly links to the nomicon's section on atomics as well. It's worth noting that the nomicon's sections on atomics is fairly short, and the nomicon's author, Gankra, has pointed to Mara Bos's O'Reilly book on low-level concurrency and said that it is essentially everything she would've wanted to say -- and that due to it being a whole book of its own, it makes sense that it's not directly present in the rust documentation or the nomicon.
Atomic synchronization is complicated and really can't be done justice in anything less than dedicated reference material just for atomics. I've not read the book I've linked here personally, but there have been plenty of testimonials that it's a very good resource for explaining and understanding atomics.
A possible solution for contradictions incompleteness(?) in regard to the documented behavior of atomics could be to simply mark std::sync::atomic
documentation explicitly as non-normative. But I don't think that's a nice way to go. But it would be least effort.
Even if you might say this is stuff only experts use, I think having good and freely (and easily) accessible documentation is a good thing to have.
Also, I think there are several use cases where high(er) level code might use atomics, e.g. a Relaxed
AtomicU64
counter to create unique IDs and maybe AtomicBool
s for signalling (though here things already get complicated because the synchronizes-with relation only goes from the Release
-store to the Acquire
-load, which may not be sufficient in some cases).
While I agree it's complicated, I feel like the way it's currently explained makes it unnecessarily complicated to understand. But maybe I'm still missing some things.
Unfortunately, explaining atomics in a way that is both accurate even in edge cases and simpler than the full standardeeze (or if you're both unlucky and picky, a CPU diagram) is a difficult question, and people who put the effort into writing such learning material generally want to get paid for the work.
Now we have two hard problems
Typically, the preferred answer is to provide wrappers that implement the patterns at the higher level,
for which the Rust API docs are sufficient, and for which a wrapper that doesn't expose the complexity atomic ordering is fundamentally trivial (which is probably why nobody actually provides this as a specialized type)
You don't want to use purely AtomicBool
, because then the waiting thread has no way to wait other than spinning, and spinlocks are bad (same link as previous). But there's a library for that:
This allows you to do condvar/futex wait
/notify
on an AtomicU32
rather than a Mutex
.
In general, you only have synchronization from stores to loads anyway. The other three options fundamentally don't make any sense, because they have no way to tell whether the previous operation that they'd theoretically by synchronized with happened. (fetch-op/compare-and-swap/etc style operations are logically both a load and a store.)
At least from my understanding of atomics, anything that isn't satisfied by AcqRel
is in the PhD-level complexity to show that the used synchronization is actually sufficient. (That said, SeqCst
is still a safer default, because it means the reordering model of atomics actually works, whereas that isn't the case for AcqRel
since threads can observe synchronized events in different orders.)
Mixing Relaxed
with AcqRel
is probably achievable, but I'd stay away without specific reason, not the least because x86_64 concretely doesn't offer weaker memory ordering that AcqRel
.
I didn't mean something like a CondVar
here, but more use-cases where you want to inform a thread to exit in the next suitable moment, for example.
At least from my understanding of atomics, anything that isn't satisfied by
AcqRel
is in the PhD-level complexity to show that the used synchronization is actually sufficient. (That said,SeqCst
is still a safer default, because it means the reordering model of atomics actually works, whereas that isn't the case forAcqRel
since threads can observe synchronized events in different orders.)
If I understood right, then SeqCst
may be leading to a wrong perception of safety though:
SeqCst
guarantees much less than people would expect it to, essentially it guarantees that all threads will agree on a single total order for allSeqCst
writes as long they read withSeqCst
- that's it. In particular, what it doesn't do is giving loadsRelease
semantics or writesAcquire
semantics.
So following that reasoning, the problem often isn't to use AcqRel
instead of SeqCst
but to use atomics in the first place (instead of mutexes).
Sorry if this has already been mentioned, but I found this to be a pretty interesting and helpful approach to teaching atomics:
146 votes and 11 comments so far on Reddit
Looking into the rendered PR for the Nomicon, I see that the issue with the release sequences (which I pointed out in issue #104814) seems to be correctly described in this part.
But even with the Nomicon being fixed by that PR (if it gets approved), the std
reference still shouldn't make wrong or misleading statements.
The drawings seem to be nice too. I will certainly look further into this. Thanks for sharing.
As a rule of thumb, there are three positions you can be in with atomics:
Relaxed
is good enough - you just need accesses to this variable to be atomic, and you don't need a guarantee that there is one globally visible ordering for accesses to this variable. This solves, for example, the "I need a unique u64
" case, and the "I need a guarantee that multiple threads reading this bool
always see either true
or false
, even when I have concurrent writes to the field".Acquire
, Release
, and AcqRel
, and you are completely confident that an Acquire
load pairs up with a Release
store to get you the semantics you need.If you're not in one of those three cases, then you should probably use a Mutex
, RwLock
, Barrier
or a channel, and let someone else care about orderings - in particular, SeqCst
is very, very rarely what you want, since the only guarantee it adds beyond AcqRel
is that there exists a single total order of operations on the atomic itself that all threads agree on. Making this guarantee is expensive on most CPUs, and I've only ever seen it needed in the case of building a dynamically sized MPMC channel with a guaranteed bound on memory consumption (computed based on the speeds of the producers and consumers).
It's also worth noting that the rise in cost as you go from non-atomic through Relaxed
through to SeqCst
is significant. On x86, any atomic is more expensive than non-atomic, but the cost of Relaxed
, Acquire
and Release
is all the same (because that's the x86 memory model). SeqCst
then adds more cost, since you now need system-wide barriers to ensure the total ordering is met.
On ARM, even going from Relaxed
to Acquire
and Release
needs a barrier - and going to SeqCst
requires heavier barriers to ensure that the total ordering is met. As the point of using an atomic directly is to get something lighter-weight than the atomics implied by the higher-level primitives like Barrier
and Mutex
, it's often a bad tradeoff to go from a high-level primitive to a SeqCst
atomic.
Relaxed
is good enough - you just need accesses to this variable to be atomic, and you don't need a guarantee that there is one globally visible ordering for accesses to this variable.
I think this might be the most common case where higher-level code would use atomics instead of mutexes.
- You can build what you need using
Acquire
,Release
, andAcqRel
, and you are completely confident that anAcquire
load pairs up with aRelease
store to get you the semantics you need.
Question: Can't I always build what I need with Acquire
and Release
? Afterall, mutexes can be built with those (they don't need SeqCst
). And with mutexes I can ensure sequential ordering of accesses where needed.
This makes me believe that using atomics with SeqCst
(instead of mutexes) indeed is an optimization for certain corner cases that are very rare.
I would say a reasonable approach in practice is this:
Relaxed
is sufficient, use atomics.In either case, SeqCst
is never needed as sequential consistency can be emulated with mutexes, which, in turn, can be implemented with Acquire
/Release
ordering, right? Or do I misunderstand something?
Assuming that you don't care about anything other than correctness then Acquire
and Release
are all you need. There are certain very weird algorithms where the extra guarantee from SeqCst
enables you to put bounds on memory consumption, CPU time, and similar non-correctness things.
The reason I say that Acquire
and Release
are OK if you can justify them is that they're sufficiently scary to put off people who don't know what they're doing, while still good enough to write lock-free algorithms if you know what you're doing. Because the names are a bit strange, and the description is technical and complex, people tend to shy away from them.
The trouble with SeqCst
is that it's an attractive nuisance; it "looks" to the naïve eye like a magic "do what I want" ordering, but most of the time (all but one time I've seen it during code review - the other time had a reference to a peer-reviewed paper showing why SeqCst
was worthwhile here) you either needed Relaxed
(on the systems we were going to run that code on, cost around 10x a non-atomic access) and are paying the price of SeqCst
or you're not actually getting the guarantees you expected (notably around the behaviour of non-atomic reads and writes between SeqCst
reads and writes).
One possibly important difference between Mutex and an atomic is that the former may (if you’re unlucky) block your thread while the latter will never do that. If what you want can be expressed by a single atomic operation, then your code is not only lock-free, it is wait-free (meaning: your thread is guaranteed to make progress if there is a CPU to run it). If you’re spinning on an atomic then chances are that you should either implement a slow path with kernel support, or simply use a Mutex instead — because otherwise you’re implementing a simple spin lock yourself.
One remark on x86: The only difference between a “release” and a relaxed write is that the core’s store buffer is drained for the release. This can be noticeable in workloads that are heavy on memory writes, but here we’re definitely in “thou shalt benchmark!” terrain.
As a rule of thumb, there are three positions you can be in with atomics:
One interesting and quite common case that isn't in your two first buckets is reference counting, where relaxed is usually used for cloning. From the standard library:
fn clone(&self) -> Arc<T> {
// Using a relaxed ordering is alright here, as knowledge of the
// original reference prevents other threads from erroneously deleting
// the object.
//
// As explained in the [Boost documentation][1], Increasing the
// reference counter can always be done with memory_order_relaxed: New
// references to an object can only be formed from an existing
// reference, and passing an existing reference from one thread to
// another must already provide any required synchronization.
//
// [1]: (www.boost.org/doc/libs/1_55_0/doc/html/atomic/usage_examples.html)
let old_size = self.inner().strong.fetch_add(1, Relaxed);
Isn't that suggesting that the access mode should be encoded in the type and not as a flag on every access functions on a universal type? The current API seems immensely accident-prone as well as unnecessarily unapproachable.
Note that it's possible to mix access modes, i.e. the same atomic can be accessed as Relaxed
from one thread, for example, and as AcqRel
from another.
A key point is that even if you follow the principle of not mixing AcqRel
with Relaxed
, this only applies within a single concurrent section. It is perfectly valid even under this practice to have a single atomic which is used Relaxed
at one point, but then after a sync point (e.g. the join in a fork-join workload; the end of a rayon par_iter
) gets used in a AcqRel
fashion.
The Atomic*
types are the building blocks, so they expose the underlying model fairly transparently.