Better understanding atomics

Yeah, that's what I went looking for more info on.

1 Like

I'm considering to open a documentation issue. These are my reasons:

  • std::sync::atomic docs claim that "Rust atomics currently follow the same rules as C++20 atomics, […]" while the documentation of std::sync::atomic::Ordering doesn't give the same guarantees as given in the C++ reference (but less guarantees, because read-modify-write operations as being part of the release sequence are ignored in the Rust docs).
  • The linked C++ reference isn't describing the important "synchronizes-with" relation, so it's really hard to understand/follow the explanations there.

Overall, it's very difficult for someone who starts with the Rust documentation to get an overview on what the Rust atomics really do. I think this should/could be improved.

2 Likes

I think it would be nice to improve the documentation, though I would make sure to think about which particular changes you would make, especially re your second point.

1 Like

I think I don't really have enough knowledge about C++ to make a PR, I simply thought of an issue (and let someone else fix this :sweat_smile:).

But I wonder if maybe the Rust documentation deliberately ignores the release sequences due to having more flexibility in future? Maybe at some point the second rule might be dropped too?

Release sequence

[…]

  1. […]
  2. Atomic read-modify-write operations made to M by any thread

I don't really know if this is relevant in practice or used by a lot of code. It certainly would be a breaking change to remove this (but removing rule 1 was a breaking change from C++11 to C++20 too, I guess).

In either case, it should be clarified whether the C++ rules really take precedence, and if they do, the Rust docs should be adjusted in that matter (or extended with a warning that the behavior of Rust regarding release sequences may change at any point in future and cannot be relied upon).

1 Like

Moreso, it's that Rust effectively includes the C++ (or rather, LLVM) atomic model by reference. Rust isn't trying to define its own atomic memory model; it's just exposing the one that already (somewhat) exists.

In general, use of atomics should be limited (within a potentially parallel unit not externally synchronized) to only touching any given atomic with Relaxed, AcqRel, or SeqCst orderings, and not mixing them. Any algorithm mixing the different orderings on a single atomic likely requires a PhD-level understanding of weak memory effects.

3 Likes

It's also worth noting that the std::atomic docs aren't the only documentation for atomics Rust provides; the API documentation in fact directly links to the nomicon's section on atomics as well. It's worth noting that the nomicon's sections on atomics is fairly short, and the nomicon's author, Gankra, has pointed to Mara Bos's O'Reilly book on low-level concurrency and said that it is essentially everything she would've wanted to say -- and that due to it being a whole book of its own, it makes sense that it's not directly present in the rust documentation or the nomicon.

Atomic synchronization is complicated and really can't be done justice in anything less than dedicated reference material just for atomics. I've not read the book I've linked here personally, but there have been plenty of testimonials that it's a very good resource for explaining and understanding atomics.

3 Likes

A possible solution for contradictions incompleteness(?) in regard to the documented behavior of atomics could be to simply mark std::sync::atomic documentation explicitly as non-normative. But I don't think that's a nice way to go. But it would be least effort.

Even if you might say this is stuff only experts use, I think having good and freely (and easily) accessible documentation is a good thing to have.

Also, I think there are several use cases where high(er) level code might use atomics, e.g. a Relaxed AtomicU64 counter to create unique IDs and maybe AtomicBools for signalling (though here things already get complicated because the synchronizes-with relation only goes from the Release-store to the Acquire-load, which may not be sufficient in some cases).


While I agree it's complicated, I feel like the way it's currently explained makes it unnecessarily complicated to understand. But maybe I'm still missing some things.

1 Like

Unfortunately, explaining atomics in a way that is both accurate even in edge cases and simpler than the full standardeeze (or if you're both unlucky and picky, a CPU diagram) is a difficult question, and people who put the effort into writing such learning material generally want to get paid for the work.

Now we have two hard problems :slightly_smiling_face:

Typically, the preferred answer is to provide wrappers that implement the patterns at the higher level,

for which the Rust API docs are sufficient, and for which a wrapper that doesn't expose the complexity atomic ordering is fundamentally trivial (which is probably why nobody actually provides this as a specialized type)

You don't want to use purely AtomicBool, because then the waiting thread has no way to wait other than spinning, and spinlocks are bad (same link as previous). But there's a library for that:

This allows you to do condvar/futex wait/notify on an AtomicU32 rather than a Mutex.

In general, you only have synchronization from stores to loads anyway. The other three options fundamentally don't make any sense, because they have no way to tell whether the previous operation that they'd theoretically by synchronized with happened. (fetch-op/compare-and-swap/etc style operations are logically both a load and a store.)

At least from my understanding of atomics, anything that isn't satisfied by AcqRel is in the PhD-level complexity to show that the used synchronization is actually sufficient. (That said, SeqCst is still a safer default, because it means the reordering model of atomics actually works, whereas that isn't the case for AcqRel since threads can observe synchronized events in different orders.)

Mixing Relaxed with AcqRel is probably achievable, but I'd stay away without specific reason, not the least because x86_64 concretely doesn't offer weaker memory ordering that AcqRel.

2 Likes

I didn't mean something like a CondVar here, but more use-cases where you want to inform a thread to exit in the next suitable moment, for example.

If I understood right, then SeqCst may be leading to a wrong perception of safety though:

So following that reasoning, the problem often isn't to use AcqRel instead of SeqCst but to use atomics in the first place (instead of mutexes).

2 Likes

I opened an issue #104814 in regard to the documentation of std::sync::atomic::Ordering.

2 Likes

Sorry if this has already been mentioned, but I found this to be a pretty interesting and helpful approach to teaching atomics:

5 Likes

Looking into the rendered PR for the Nomicon, I see that the issue with the release sequences (which I pointed out in issue #104814) seems to be correctly described in this part.

But even with the Nomicon being fixed by that PR (if it gets approved), the std reference still shouldn't make wrong or misleading statements.

The drawings seem to be nice too. I will certainly look further into this. Thanks for sharing.

1 Like

As a rule of thumb, there are three positions you can be in with atomics:

  1. Relaxed is good enough - you just need accesses to this variable to be atomic, and you don't need a guarantee that there is one globally visible ordering for accesses to this variable. This solves, for example, the "I need a unique u64" case, and the "I need a guarantee that multiple threads reading this bool always see either true or false, even when I have concurrent writes to the field".
  2. You can build what you need using Acquire, Release, and AcqRel, and you are completely confident that an Acquire load pairs up with a Release store to get you the semantics you need.
  3. There's a peer-reviewed paper, a PhD thesis, or other similar grade of theoretical work, describing your exact use of atomic operations, and you're only after the guarantees that the academic work says you can get.

If you're not in one of those three cases, then you should probably use a Mutex, RwLock, Barrier or a channel, and let someone else care about orderings - in particular, SeqCst is very, very rarely what you want, since the only guarantee it adds beyond AcqRel is that there exists a single total order of operations on the atomic itself that all threads agree on. Making this guarantee is expensive on most CPUs, and I've only ever seen it needed in the case of building a dynamically sized MPMC channel with a guaranteed bound on memory consumption (computed based on the speeds of the producers and consumers).

It's also worth noting that the rise in cost as you go from non-atomic through Relaxed through to SeqCst is significant. On x86, any atomic is more expensive than non-atomic, but the cost of Relaxed, Acquire and Release is all the same (because that's the x86 memory model). SeqCst then adds more cost, since you now need system-wide barriers to ensure the total ordering is met.

On ARM, even going from Relaxed to Acquire and Release needs a barrier - and going to SeqCst requires heavier barriers to ensure that the total ordering is met. As the point of using an atomic directly is to get something lighter-weight than the atomics implied by the higher-level primitives like Barrier and Mutex, it's often a bad tradeoff to go from a high-level primitive to a SeqCst atomic.

6 Likes

I think this might be the most common case where higher-level code would use atomics instead of mutexes.

Question: Can't I always build what I need with Acquire and Release? Afterall, mutexes can be built with those (they don't need SeqCst). And with mutexes I can ensure sequential ordering of accesses where needed.

This makes me believe that using atomics with SeqCst (instead of mutexes) indeed is an optimization for certain corner cases that are very rare.

I would say a reasonable approach in practice is this:

  • If Relaxed is sufficient, use atomics.
  • For anything else use mutexes.
  • If performance is an issue, you may optimize some problems with atomics. But in this case you need to be very careful anyway and exactly follow which guarantees are needed and which synchronizations take place.

In either case, SeqCst is never needed as sequential consistency can be emulated with mutexes, which, in turn, can be implemented with Acquire/Release ordering, right? Or do I misunderstand something?

2 Likes

Assuming that you don't care about anything other than correctness then Acquire and Release are all you need. There are certain very weird algorithms where the extra guarantee from SeqCst enables you to put bounds on memory consumption, CPU time, and similar non-correctness things.

The reason I say that Acquire and Release are OK if you can justify them is that they're sufficiently scary to put off people who don't know what they're doing, while still good enough to write lock-free algorithms if you know what you're doing. Because the names are a bit strange, and the description is technical and complex, people tend to shy away from them.

The trouble with SeqCst is that it's an attractive nuisance; it "looks" to the naïve eye like a magic "do what I want" ordering, but most of the time (all but one time I've seen it during code review - the other time had a reference to a peer-reviewed paper showing why SeqCst was worthwhile here) you either needed Relaxed (on the systems we were going to run that code on, cost around 10x a non-atomic access) and are paying the price of SeqCst or you're not actually getting the guarantees you expected (notably around the behaviour of non-atomic reads and writes between SeqCst reads and writes).

1 Like

One possibly important difference between Mutex and an atomic is that the former may (if you’re unlucky) block your thread while the latter will never do that. If what you want can be expressed by a single atomic operation, then your code is not only lock-free, it is wait-free (meaning: your thread is guaranteed to make progress if there is a CPU to run it). If you’re spinning on an atomic then chances are that you should either implement a slow path with kernel support, or simply use a Mutex instead — because otherwise you’re implementing a simple spin lock yourself.


One remark on x86: The only difference between a “release” and a relaxed write is that the core’s store buffer is drained for the release. This can be noticeable in workloads that are heavy on memory writes, but here we’re definitely in “thou shalt benchmark!” terrain.

2 Likes

One interesting and quite common case that isn't in your two first buckets is reference counting, where relaxed is usually used for cloning. From the standard library:

fn clone(&self) -> Arc<T> {
    // Using a relaxed ordering is alright here, as knowledge of the
    // original reference prevents other threads from erroneously deleting
    // the object.
    //
    // As explained in the [Boost documentation][1], Increasing the
    // reference counter can always be done with memory_order_relaxed: New
    // references to an object can only be formed from an existing
    // reference, and passing an existing reference from one thread to
    // another must already provide any required synchronization.
    //
    // [1]: (www.boost.org/doc/libs/1_55_0/doc/html/atomic/usage_examples.html)
    let old_size = self.inner().strong.fetch_add(1, Relaxed);
5 Likes

Isn't that suggesting that the access mode should be encoded in the type and not as a flag on every access functions on a universal type? The current API seems immensely accident-prone as well as unnecessarily unapproachable.

Note that it's possible to mix access modes, i.e. the same atomic can be accessed as Relaxed from one thread, for example, and as AcqRel from another.

A key point is that even if you follow the principle of not mixing AcqRel with Relaxed, this only applies within a single concurrent section. It is perfectly valid even under this practice to have a single atomic which is used Relaxed at one point, but then after a sync point (e.g. the join in a fork-join workload; the end of a rayon par_iter) gets used in a AcqRel fashion.

The Atomic* types are the building blocks, so they expose the underlying model fairly transparently.

1 Like