I would like to better understand when to use which memory ordering. Of course, I could always use SeqCst, but I would like to get better knowlege/understanding about which orderings are really needed.
Maybe the best advice is to re-read the C++ reference on that matter over and over and go through (self made) examples to get a better understanding? Or is there a good website which explains this in a simpler way, ideally with graphical visualization of what happens?
I feel like, the more you learn about atomics, the more you conclude that using seqcst everywhere but in the most simple cases (or in well-reviewed and performance sensitive library code) might the most sane approach to avoid mistakes.
whose API abstracts away the fact that atomics were used ↩︎
Well, I would like to be able to write such library code one day. E.g. to improve things like radiorust::sync::broadcast_bp, which will not expose how it internally works and currently uses mutexes. (Not sure if it's easily possible to get rid of the mutexes anyway, as I didn't pursue it further because atomics scared me.)
In that particular example, performance isn't that important though (as usually larger chunks are being sent), but I'm still interested in learning more about atomics (mostly for the sake of self-study and simply explanding my knowledge, and not because I really need them).
whose API abstracts away the fact that atomics were used ↩︎
So I guess it's similar to unsafe (though unsafe comes with the direct risk of UB): Sometimes it can make sense to not use these optimizations, just for the sake of making the code easier to read and verify.
So perhaps where I use a Mutex to be "safe", I could instead use atomics with SeqCst, even if there's a little overhead coming with that.
But this raises a question: What's worse, performance-wise: A Mutex (which internally uses Acquire/Release or an atomic which uses SeqCst? I.e.: Is it even worth to move from mutexes to atomics when you only use SeqCst?
Rust uses the C++ model for atomics, so if they disagree, the C++ rules take precedence.
That said, the rules you mention here are phrased in terms of "so and so optimization can't happen", but this kind of language can never be a precise way to describe the rules that atomics follow. It is only used for summaries. Read the actual rules to see what the actual rules are.
I'll be sure to read the "Formal description" (from the C++ reference) more thoroughly again. However, it would be nice if documentation wouldn't disagree.
Even though the Rust documentation doesn't explicitly say that the given guarantees are the only guarantees, it is confusing, and I think this should be fixed. Unless it can be shown that both guarantees are equivalent?
However, before suggesting any improvement, I want to keep watching and re-reading until I have a better understanding myself.
I think both summaries are attempts at explaining this image from the link I just posted:
Ultimately, due to the as-if rule, you cannot ever guarantee that some reordering doesn't happen in the final binary, as long as it behaves as-if they weren't reordered. The only way to tell from the program behavior whether something is reordered after a store is to try to access it from after an acquire load in another thread that synchronizes with the release store. This makes the two descriptions equivalent.
So if I understand it right, then because I cannot observe whether the stronger guarantee for Release holds without actually using an Acquire in the other thread, the two descriptions are equivalent.
I don't overlook this well enough to really understand (yet). In any case, I would like to note that Rust's documentation says:
Release: When coupled with a store, all previous operations become ordered before any load of this value with Acquire (or stronger) ordering.
Note the "this value" constraint. I.e. if I have a Release store of one variable in one thread, and a Acquire load of another variable in the other thread, then there would be no guarantees about orderings at all (following the Rust documentation), while the C++ reference does give guarantees.
Maybe this is still not observable in theory, but like I said I don't really overlook it well enough to reason about that yet.
Maybe I can elaborate on why talking about reorderings is not a good way to talk about atomics. In the physical reality, you might have thread 1 perform two writes to two different atomics, and have them become visible in the order A,B on thread 2, but become visible in the order B,A on thread 3. Was A and B reordered or not? The question is not meaningful.
This is why the formal description talks about whether the side effects are visible or not, and all of the guarantees of the form "in so and so situation, X happens-before Y" meaning that side effects of X are visible to Y.
In other words, "reality" (as in observable behavior) is different for each thread? So it doesn't really matter what really happens (as in physical reality or in the underlying machine), but only what's observable.
The cited part of the C++ reference isn't part of the "Formal description" anyway. I will finish the video, and later try to understand the formal description and the implications better.
Yes, what is important is making sure that your program has the correct happens-before relationships in the formal model. As long as you ensure that this is the case, then your program is guaranteed to work correctly on any kind of CPU no matter how unspeakably weird the physical reality might be.
I recommend watching Herb Sutter's <atomic> Weapons talks. They'll likely give you at least some insight, although that insight might just be "this shit is deep". It doesn't help that x86/64 is a famously strongly ordered architecture, essentially causing two things:
spending time reasoning about acq/rel is unlikely to net you performance gains over seq-cst
if you do use acq/rel incorrectly it's unlikely to manifest as a visible bug
These days, of course, ARM is a fairly big deal, and the situation is different there. But if you develop on x86/64, you're likely to miss acq/rel bugs that manifest fairly often on ARM.
A Mutex that isn't under high contention should generally be quite fast, as a lot of the overhead of a Mutex doesn't get involved when it's not currently locked. It will obviously vary across Mutex implementations though. There's some more analysis of performance of a Windows library Mutex under various degrees of contention here
There are quite a few posts on that blog you may want to read if you're interested in wrapping your head around atomics in general, Alice already linked to one.