Better understanding atomics

I feel like, the more you learn about atomics, the more you conclude that using seqcst everywhere but in the most simple cases (or in well-reviewed and performance sensitive library code[1]) might the most sane approach to avoid mistakes.


  1. whose API abstracts away the fact that atomics were used ↩︎

13 Likes

Well, I would like to be able to write such library code one day. E.g. to improve things like radiorust::sync::broadcast_bp, which will not expose how it internally works and currently uses mutexes. (Not sure if it's easily possible to get rid of the mutexes anyway, as I didn't pursue it further because atomics scared me.)

In that particular example, performance isn't that important though (as usually larger chunks are being sent), but I'm still interested in learning more about atomics (mostly for the sake of self-study and simply explanding my knowledge, and not because I really need them).


  1. whose API abstracts away the fact that atomics were used ↩︎

1 Like

You could give the Crust of Rust episode on atomics a go:

12 Likes

So I guess it's similar to unsafe (though unsafe comes with the direct risk of UB): Sometimes it can make sense to not use these optimizations, just for the sake of making the code easier to read and verify.

So perhaps where I use a Mutex to be "safe", I could instead use atomics with SeqCst, even if there's a little overhead coming with that.

But this raises a question: What's worse, performance-wise: A Mutex (which internally uses Acquire/Release or an atomic which uses SeqCst? I.e.: Is it even worth to move from mutexes to atomics when you only use SeqCst?

1 Like

Your post inspired me to research what mutex even guarantee and when do you actually need seqcst and found interesting example in this issue AtomicCell: Why do you use SeqCst? · Issue #317 · crossbeam-rs/crossbeam · GitHub

1 Like

I think that's related to what I asked here:

I'm overall confused yet, but will try to continue watching the video to get a better understanding finally.

1 Like

He points out something at 1:14:43, which is that what's found in the C++ reference

memory_order_release: A store operation with this memory order performs the release operation: no reads or writes in the current thread can be reordered after this store.

isn't found in the Rust documentation:

Release: When coupled with a store, all previous operations become ordered before any load of this value with Acquire (or stronger) ordering.

The C++ reference seems to be stricter than the Rust guarantees.

This is surprising, because the Rust documentation says in a different place that

Rust atomics currently follow the same rules as C++20 atomics, specifically atomic_ref.

1 Like

Rust uses the C++ model for atomics, so if they disagree, the C++ rules take precedence.

That said, the rules you mention here are phrased in terms of "so and so optimization can't happen", but this kind of language can never be a precise way to describe the rules that atomics follow. It is only used for summaries. Read the actual rules to see what the actual rules are.

4 Likes

So the normative source is this then:

https://en.cppreference.com/w/cpp/atomic/memory_order

?

1 Like

Yes. In particular, it is the section titled "Formal description".

2 Likes

One article on atomics I quite like is this one: The Synchronizes-With Relation

4 Likes

I'll be sure to read the "Formal description" (from the C++ reference) more thoroughly again. However, it would be nice if documentation wouldn't disagree. :sweat_smile:

Even though the Rust documentation doesn't explicitly say that the given guarantees are the only guarantees, it is confusing, and I think this should be fixed. Unless it can be shown that both guarantees are equivalent?

However, before suggesting any improvement, I want to keep watching and re-reading until I have a better understanding myself.

1 Like

I think both summaries are attempts at explaining this image from the link I just posted:

image

Ultimately, due to the as-if rule, you cannot ever guarantee that some reordering doesn't happen in the final binary, as long as it behaves as-if they weren't reordered. The only way to tell from the program behavior whether something is reordered after a store is to try to access it from after an acquire load in another thread that synchronizes with the release store. This makes the two descriptions equivalent.

7 Likes

So if I understand it right, then because I cannot observe whether the stronger guarantee for Release holds without actually using an Acquire in the other thread, the two descriptions are equivalent.

I don't overlook this well enough to really understand (yet). In any case, I would like to note that Rust's documentation says:

Release: When coupled with a store, all previous operations become ordered before any load of this value with Acquire (or stronger) ordering.

Note the "this value" constraint. I.e. if I have a Release store of one variable in one thread, and a Acquire load of another variable in the other thread, then there would be no guarantees about orderings at all (following the Rust documentation), while the C++ reference does give guarantees.

Maybe this is still not observable in theory, but like I said I don't really overlook it well enough to reason about that yet.

1 Like

Maybe I can elaborate on why talking about reorderings is not a good way to talk about atomics. In the physical reality, you might have thread 1 perform two writes to two different atomics, and have them become visible in the order A,B on thread 2, but become visible in the order B,A on thread 3. Was A and B reordered or not? The question is not meaningful.

This is why the formal description talks about whether the side effects are visible or not, and all of the guarantees of the form "in so and so situation, X happens-before Y" meaning that side effects of X are visible to Y.

2 Likes

In other words, "reality" (as in observable behavior) is different for each thread? So it doesn't really matter what really happens (as in physical reality or in the underlying machine), but only what's observable.

The cited part of the C++ reference isn't part of the "Formal description" anyway. I will finish the video, and later try to understand the formal description and the implications better.

1 Like

Yes, what is important is making sure that your program has the correct happens-before relationships in the formal model. As long as you ensure that this is the case, then your program is guaranteed to work correctly on any kind of CPU no matter how unspeakably weird the physical reality might be.

3 Likes

I recommend watching Herb Sutter's <atomic> Weapons talks. They'll likely give you at least some insight, although that insight might just be "this shit is deep". It doesn't help that x86/64 is a famously strongly ordered architecture, essentially causing two things:

  • spending time reasoning about acq/rel is unlikely to net you performance gains over seq-cst
  • if you do use acq/rel incorrectly it's unlikely to manifest as a visible bug

These days, of course, ARM is a fairly big deal, and the situation is different there. But if you develop on x86/64, you're likely to miss acq/rel bugs that manifest fairly often on ARM.

8 Likes

A Mutex that isn't under high contention should generally be quite fast, as a lot of the overhead of a Mutex doesn't get involved when it's not currently locked. It will obviously vary across Mutex implementations though. There's some more analysis of performance of a Windows library Mutex under various degrees of contention here

There are quite a few posts on that blog you may want to read if you're interested in wrapping your head around atomics in general, Alice already linked to one.

3 Likes

Just as an initial note -- on x86_64 processors, using SeqCst is going to very rarely actually be any different than AcqRel, as x86_64 processors have strong ordering guarantees that (at the current time, anyway) cannot be relaxed, so the difference only makes an impact to the compiler. With correct usage of AcqRel, the impact on what freedom the compiler has is small as well.

If you are replacing Mutex<usize> with AtomicUsize and you are not using compare_exchange loops (fetch_update), then using the atomic will generally perform marginally better. However, any improvement will likely be marginal, even with Relaxed, as your "critical section" where you hold onto the lock is small enough that the chance of one thread locking the mutex while another thread holds the lock is very low.

If you require using fetch_update, it becomes much more of a toss-up, and depends a lot on the use case. If the update is "fast," then an atomic (even with SeqCst) would likely be marginally better. If the update is "slow" enough that multiple updates "race" with each other, using a Mutex will typically perform better, especially under high load, because in general spinning isn't great.

If you're using the Mutex to guard more than just what can be expressed solely by Atomic's API surface, you'll be hard pressed to beat using a Mutex no matter what ordering you use. What's much more important is structuring your cross-thread communication to minimize lock contention, and comparatively it doesn't matter whether the communication channels use mutual exclusion or atomics to communicate.

8 Likes