Better understanding atomics

I opened an issue #104814 in regard to the documentation of std::sync::atomic::Ordering.

2 Likes

Sorry if this has already been mentioned, but I found this to be a pretty interesting and helpful approach to teaching atomics:

5 Likes

Looking into the rendered PR for the Nomicon, I see that the issue with the release sequences (which I pointed out in issue #104814) seems to be correctly described in this part.

But even with the Nomicon being fixed by that PR (if it gets approved), the std reference still shouldn't make wrong or misleading statements.

The drawings seem to be nice too. I will certainly look further into this. Thanks for sharing.

1 Like

As a rule of thumb, there are three positions you can be in with atomics:

  1. Relaxed is good enough - you just need accesses to this variable to be atomic, and you don't need a guarantee that there is one globally visible ordering for accesses to this variable. This solves, for example, the "I need a unique u64" case, and the "I need a guarantee that multiple threads reading this bool always see either true or false, even when I have concurrent writes to the field".
  2. You can build what you need using Acquire, Release, and AcqRel, and you are completely confident that an Acquire load pairs up with a Release store to get you the semantics you need.
  3. There's a peer-reviewed paper, a PhD thesis, or other similar grade of theoretical work, describing your exact use of atomic operations, and you're only after the guarantees that the academic work says you can get.

If you're not in one of those three cases, then you should probably use a Mutex, RwLock, Barrier or a channel, and let someone else care about orderings - in particular, SeqCst is very, very rarely what you want, since the only guarantee it adds beyond AcqRel is that there exists a single total order of operations on the atomic itself that all threads agree on. Making this guarantee is expensive on most CPUs, and I've only ever seen it needed in the case of building a dynamically sized MPMC channel with a guaranteed bound on memory consumption (computed based on the speeds of the producers and consumers).

It's also worth noting that the rise in cost as you go from non-atomic through Relaxed through to SeqCst is significant. On x86, any atomic is more expensive than non-atomic, but the cost of Relaxed, Acquire and Release is all the same (because that's the x86 memory model). SeqCst then adds more cost, since you now need system-wide barriers to ensure the total ordering is met.

On ARM, even going from Relaxed to Acquire and Release needs a barrier - and going to SeqCst requires heavier barriers to ensure that the total ordering is met. As the point of using an atomic directly is to get something lighter-weight than the atomics implied by the higher-level primitives like Barrier and Mutex, it's often a bad tradeoff to go from a high-level primitive to a SeqCst atomic.

6 Likes

I think this might be the most common case where higher-level code would use atomics instead of mutexes.

Question: Can't I always build what I need with Acquire and Release? Afterall, mutexes can be built with those (they don't need SeqCst). And with mutexes I can ensure sequential ordering of accesses where needed.

This makes me believe that using atomics with SeqCst (instead of mutexes) indeed is an optimization for certain corner cases that are very rare.

I would say a reasonable approach in practice is this:

  • If Relaxed is sufficient, use atomics.
  • For anything else use mutexes.
  • If performance is an issue, you may optimize some problems with atomics. But in this case you need to be very careful anyway and exactly follow which guarantees are needed and which synchronizations take place.

In either case, SeqCst is never needed as sequential consistency can be emulated with mutexes, which, in turn, can be implemented with Acquire/Release ordering, right? Or do I misunderstand something?

2 Likes

Assuming that you don't care about anything other than correctness then Acquire and Release are all you need. There are certain very weird algorithms where the extra guarantee from SeqCst enables you to put bounds on memory consumption, CPU time, and similar non-correctness things.

The reason I say that Acquire and Release are OK if you can justify them is that they're sufficiently scary to put off people who don't know what they're doing, while still good enough to write lock-free algorithms if you know what you're doing. Because the names are a bit strange, and the description is technical and complex, people tend to shy away from them.

The trouble with SeqCst is that it's an attractive nuisance; it "looks" to the naïve eye like a magic "do what I want" ordering, but most of the time (all but one time I've seen it during code review - the other time had a reference to a peer-reviewed paper showing why SeqCst was worthwhile here) you either needed Relaxed (on the systems we were going to run that code on, cost around 10x a non-atomic access) and are paying the price of SeqCst or you're not actually getting the guarantees you expected (notably around the behaviour of non-atomic reads and writes between SeqCst reads and writes).

1 Like

One possibly important difference between Mutex and an atomic is that the former may (if you’re unlucky) block your thread while the latter will never do that. If what you want can be expressed by a single atomic operation, then your code is not only lock-free, it is wait-free (meaning: your thread is guaranteed to make progress if there is a CPU to run it). If you’re spinning on an atomic then chances are that you should either implement a slow path with kernel support, or simply use a Mutex instead — because otherwise you’re implementing a simple spin lock yourself.


One remark on x86: The only difference between a “release” and a relaxed write is that the core’s store buffer is drained for the release. This can be noticeable in workloads that are heavy on memory writes, but here we’re definitely in “thou shalt benchmark!” terrain.

2 Likes

One interesting and quite common case that isn't in your two first buckets is reference counting, where relaxed is usually used for cloning. From the standard library:

fn clone(&self) -> Arc<T> {
    // Using a relaxed ordering is alright here, as knowledge of the
    // original reference prevents other threads from erroneously deleting
    // the object.
    //
    // As explained in the [Boost documentation][1], Increasing the
    // reference counter can always be done with memory_order_relaxed: New
    // references to an object can only be formed from an existing
    // reference, and passing an existing reference from one thread to
    // another must already provide any required synchronization.
    //
    // [1]: (www.boost.org/doc/libs/1_55_0/doc/html/atomic/usage_examples.html)
    let old_size = self.inner().strong.fetch_add(1, Relaxed);
5 Likes

Isn't that suggesting that the access mode should be encoded in the type and not as a flag on every access functions on a universal type? The current API seems immensely accident-prone as well as unnecessarily unapproachable.

Note that it's possible to mix access modes, i.e. the same atomic can be accessed as Relaxed from one thread, for example, and as AcqRel from another.

A key point is that even if you follow the principle of not mixing AcqRel with Relaxed, this only applies within a single concurrent section. It is perfectly valid even under this practice to have a single atomic which is used Relaxed at one point, but then after a sync point (e.g. the join in a fork-join workload; the end of a rayon par_iter) gets used in a AcqRel fashion.

The Atomic* types are the building blocks, so they expose the underlying model fairly transparently.

1 Like

One example where access modes are mixed is Arc, which I posted as an example before. Reference decrement use Release on Arc.

1 Like

I wonder if the docs should/could be extended with a warning, explaining that a SeqCst atomic access isn't giving the same guarantees as accessing a variable through a mutex?

If the docs are being changed, then I think a better warning would be that if you're considering any ordering other than Relaxed, you should prefer one of the higher-level primitives in std::sync unless you're confident that you're getting your atomic handling right.

But phrased by someone who knows how to write clear warnings.

My reasoning behind this is that there are only three types of people who'll read the docs to begin with:

  1. People who don't really know what they're doing, who currently pick SeqCst because it's the "safe" option. These people need to be guided to use Relaxed or a lock instead.
  2. People porting an algorithm from an academic paper to Rust. These people are almost certainly just looking up how Rust spells memory_order_acq_rel or whatever the paper uses, and will, at most, be triggered into looking to see if the thing they're building already exists in the standard library.
  3. People who've read and understood one of the excellent books on the topic, such as Mara Bos's " Rust Atomics and Locks" which was recommended earlier in the thread, and won't care about such a warning because they know what they're doing.
2 Likes

I personally prefer explanations why I shouldn't or should do something, rather than "keep your hands off unless you know what you're doing". The first makes it easier to learn, the second might keep me from doing the wrong thing, but it is also discouraging from learning more (or at least not encouraging to do so).

Alternatively, a combination of both would do.

If you feel like it, maybe you like to create a PR. I'm saying this because your point regarding SeqCst really surprised me, and it might surprise other readers of the std docs as well. But maybe the problem will be solved/improved anyway with the Nomicon PR #378 (which has been indirectly linked above already).

1 Like

If I was to open a PR (I'm at the point where I consider what I said to be "obvious", which generally means I learnt it so long ago that I have no real clue how I got to the point where this is my understanding, and not that other people come to the same conclusion), I'd need significant help with the language used.

The core is that SeqCst only gives you extra guarantees on other SeqCst atomic operations. It does not give you anything extra on any other operations.

If I were to try and document this, I'd want to say a few things:

  1. Relaxed should be your "default" choice of ordering - you only need the other choices if you want accesses to this atomic by other threads to imply something about other loads and stores.
  2. If you're going beyond Relaxed, check std::sync for higher-level primitives that provide the semantics you want.
  3. SeqCst's extra guarantees over and above Acquire/Release/AcqRel only apply to SeqCst atomic accesses, and not to all other memory accesses in the program. It is thus rarely what you want, as it is expensive compared to AcqRel and the guarantees are only of use when considering the atomic operations in the program in isolation.
2 Likes

What (I think) I have learned from this thread is that when I use a mutex, then it effectivly acts as some sort of memory barrier as in anything that has been observed by a thread A within or before one critical section where the mutex is locked will also be visible by thread B within or after another critical section where the same mutex has been locked afterwards.

This particularly also affects data that is not stored in the mutex. See also @alice's comment to my other question here.

Atomics, in contrast, will not necessarily ensure that. They may ensure it if you use Release and Acquire (or AcqRel or SeqCst), but that depends on which value has actually been read by the Acquire (i.e. if it was a value written by the Release or any value written by the release sequence).

This is the information that might be helpful for a beginner, I think.

(Not sure if I explained/reflected everything correctly.)

1 Like

Just to make it explicit: the reason that locking a mutex behaves this way is because it internally introduces a Release/Acquire edge between the two locks. Just like atomics, you only get the synchronization if the same Mutex (atomic) is used from both threads.

Rust's Mutex could theoretically only synchronize the contents, since Rust's Mutex is dataful, but Rust has opted to stick with the standard semantics where Mutex synchronization is sufficient for a synchronization barrier for all memory accesses in the thread. (Plus, there's no known hardware where the weaker guarantee would actually be beneficial.)

In C++ this is a clearer relation, since C++ only provides a "raw mutex" and the user is in charge of pairing locks around accesses to the logically protected data.

(Things get wacky when you thread state through multiple atomic locations and try to track causality through them, especially when mixing in Relaxed accesses... so, generally, don't. Stick to reasonably encapsulated chunks of synchronization if at all possible.)

2 Likes

I stumbled on atomics trying to share a number between tasks, thinking using a Mutex was overkill. Didn't expect to land in a booby trapped house :grin:.

This brings the question: are atomics "here be dragons" territory and should they not be used for regular application programming? The crate and type naming does not help here, as it looks like innocent types yet it's very easy to write wrong code (including stuff that works on x86 and fails elsewhere, which is a class A footgun).

1 Like

As with all things: it depends.

First of all: you're not going to run into problems with atomics unless you use unsafe. While atomics are themselves not unsafe, using them to synchronize other data is fundamentally unsafe.

I'm fairly confident to say that any use of an atomic which does not use unsafe only wants Relaxed ordering. Synchronization can only matter if you're using unsafe to pass non-Send/Sync data between threads, as anything Sync doesn't need external synchronization.

If what you want to express is expressible with just the safe API of atomics (e.g. a monotonic counter), then atomics potentially are what you want.

If you're reaching for unsafe, though, it's the same as all other times you feel the call of unsafe: you'd quite potentially be better off finding an existing safe abstraction[1] that does what you want instead.

So in short, yeah, I'd recommend application code to use an existing safe synchronization approach rather than roll their own. For the Relaxed counter case, that would be to instead of everyone incrementing the same counter, use a fork-join structure where each worker thread makes keeps its own counter and you aggregate them when joining the results together.


  1. And I say this as someone perhaps too confident in reaching for unsafe to bludgeon exotic patterns into Rust which can't be expressed with a pure safe library API. ↩︎

3 Likes