Why does Rust not permit CAS operations with Release/Acquire ordering?

I've opened a thread in the internals forum and was advised that here would be place to discuss, so I'll start the same discussion here:

Using compare_exchange on atomic types in Rust with release as success and acquire on failure ordering is not permitted and will fail because "a failure ordering can't be stronger than a success ordering".

Digging around in the LLVM references I found this quote:

The success and failure ordering arguments specify how this cmpxchg synchronizes with other atomic operations. Both ordering parameters must be at least monotonic , the ordering constraint on failure must be no stronger than that on success, and the failure ordering cannot be either release or acq_rel .

However, I don't see why this should include release/acquire ordering. Neither of those is stronger than the other and I've come across some cases where this would be the correct ordering in my assessment. In these cases I would either use acquire-release/acquire ordering or release/relaxed and an acquire fence in the failure branch.

Perhaps not as relevant to Rust, but here is what GCC has to say about this issue (also no mention of a separate load and store part):

If desired is written into *ptr then true is returned and memory is affected according to the memory order specified by success_memorder. There are no restrictions on what memory order can be used here.

Otherwise, false is returned and memory is affected according to failure_memorder. This memory order cannot be __ATOMIC_RELEASE nor __ATOMIC_ACQ_REL . It also cannot be a stronger order than that specified by success_memorder.

To summarize my arguments from the other thread, I don't think that applying terms like stronger or weaker does not have any meaning when talking about acquire and release and so the current restriction should not apply for this specific combination and should be allowed.

You're right that an ordering is not defined between Acquire and Release, but as I've been saying in the other thread, that's because these orderings can't be specified on the same operation anyway. Acquire ordering is only defined for loads, and Release ordering is only defined for stores. So there is never a case where you need to care about the "stronger than" relationship between the two orderings, because that never comes up.

I'd say this is an issue with the terminology: usually, it is simpler to say "not greater" rather than "lesser or equal", and I think that here the authors of the initial spec have been baited into "wrongly" using such terminology. Indeed, the two are only equivalent when there is a total order, which is not the case for atomic memory orderings: one indeed cannot compare Acquire with Release!

Here is my understanding of how the different orderings compare to each other:

So, imho, the spec, or at least the Rust documentation, should be saying someting like:

For failure_memorder, the ordering cannot "contain" Release semantics (as they would make no sense in that context), hence ruling out Release and AcqRel, and must also be weaker or equal to the ordering used for success_memorder.

So that the fact that success: Release, failure: Acquire is rejected fits that description.


You also seem to suggest that this "restriction" is not necessary for a CAS operation, since you think that there are use cases for success: Release, failure: Acquire.

I cannot answer to that part of your claim, as I am not knowledgeable enough regarding the different implementations of atomic CAS operations. I will just say that upgrading success to be AcqRel would fit your use case, although maybe in a suboptimal way?

I mostly agree with you, Acquire and Release should be considered to be of equal "strength", thus when the LLVM docs say " the ordering constraint on failure must be no stronger than that on success", this should not be applied to the case of release/acquire ordering.
I think I also get @jethrogb's argument now, but I don't think you should necessarily read the that specs that way, since they're too ambiguous in that regard.

I guess I will have to try to get an answer from some LLVM mailing list for clarification.

I believe that requiring acquire-release/acquire is unnecessarily strict in certain cases. Also, I think it would be trivially possible to emulate release/acquire ordering by using release/relaxed instead and placing an explicit acquire fence in the failure branch only. But this obfuscates the code in my view and also leads to false thread-sanitizer positives.

1 Like

What I understand from the docs is that on success both a load and a store happen. It specifies that if you only set one, the other is Relaxed.

Thus, success: Release equals success store Release, load: Relaxed. Now on failure there is only a load. It cannot be stronger than the load on success, hence only allowed here is Relaxed.

I think what you are looking for is success: AcqRel, failure: Release.

Looks like the API design is rather confusing. Maybe letting people setting the levels for load and store would make more sense than success/failure, but then I haven't looked in to deeply about why it is this way. I Just quickly read the docs, so don't take my word for it.

A little opinion, note that it is seldomly correct to choose AcqRel over SeqCst. An excellent explanation for this is Herb Sutter's talk. Warning, it's three hours and you might need to watch it several times to fully grasp it:

1 Like

I think the reason why compiler backends have this "ordering must be strictly stronger" rule is that it is needed if you want atomics implementation to be straightforward.

No matter if hardware has an actual CAS instruction or emulates it using load-linked/store-conditional pairs, the idea is always that when there is a CAS in the code, the hardware will perform a load, check if it meets expectations, and if so perform a store (with some memory bus magic in the middle to prevent/detect racey writes). Therefore, if you want to ensure a certain load ordering on failure (where only the load is performed) without adding any other instruction, you must ensure at least this level of load ordering on success (where the load is followed by a store).

Sure, if you really wanted to have an Acquire load on failure and a Relaxed load on success, you could do a Relaxed CAS, check if it failed, and if so insert an Acquire fence... but atomics are supposed to translate into a very simple and efficient stream of hardware instructions, and silently adding branches to the code depending on which ordering was specified definitely doesn't match this expectation. Besides, a Relaxed load followed by an Acquire fence imposes stronger synchronization than a true Acquire load would (it prevents moving previous loads after the Acquire fence). For all these reasons, I'm personally glad that one needs to explicitly write the code down if this behavior is wanted.

I agree that the wording used by these compiler documentations is confusing however, and I would personally phrase it as "the load ordering on failure cannot be stronger than the one on success", clarifying that this is solely about load ordering, not store ordering.

2 Likes

What I don't understand, failure can only be detected at runtime, after the load has already taken place.

So the load ordering must be exactly the same for success and failure....?

You're right, now that I think about it. In the picture that I painted above, if a weaker load ordering is specified on failure, the compiler should be forced to strengthen it to match the load ordering that was specified in the successful case. So why, then, would we be allowed to specify a weaker load ordering on failure than on success?

Cross-checking a mapping of C++11 atomics to hardware instructions gives us a possible explanation. In this table, I see that "cmpxchg" can translate into a loop on architectures with a LL/SC model, which initially surprised me, but makes sense as soon as one remembers that store-conditional failure can happen in more cases than atomic compare-exchange transaction failure alone (which means that here we'd be looking at the "strong" flavor of C++11 compare-exchange, not the "weak" one).

As soon as you have a loop (and thus branches) in the generated hardware instruction stream, compilers can choose to use different memory barriers depending on which sides of the loop's inner branches are taken. Perhaps this could actually allow the compiler to make the "success" load ordering stronger than the "failure" one?

I was wondering along those lines, but since I haven't looked into it deeply enough, I just turned it into a question. The problem is that it's a load, so acquire semantics mean you can move instructions from before to after the load, not the other way around. The branching in any case happens after the load has taken place, so I have a hard time imagining how it's possible to use a different ordering, but I'm by no expert. I would argue that the documentation by no means allows users untrained in the arcane arts of atomics to have any clue about what's really happening and how to use the API correctly, that's why I would propose to default to SeqCst, especially as the performance benefit of AcqRel is questionable at best and even guaranteeing that the code stays correct is rather hard.

Note that the CPU can also reorder, not only the compiler, but even then it would have to check for failure conditions before the load and I doubt that's being done, since the result normally depends on instructions in other threads, so it cannot be done CPU local or cache local.

Consider the AArch64 implementation of relaxed cmpxchg in those mappings, for example:

_loop:
    ldxr roldval, [rptr]
    cmp roldval, rold
    b.ne _exit
    stxr rres, rnewval, [rptr]
    cbnz rres, _loop
_exit

If you really wanted to have Relaxed loads on failure and Acquire loads on success, you could add an Acquire fence (which according to those mappings can be implemented as dmb ish ld) on the successful path, after the cbnz rres, _loop instruction.

This fence would not be executed if the comparison fails and the b.ne _exit branch in the middle is taken.

Now, it is dubious to me whether there is a realistic situation where this would actually be beneficial to performance, given that acquire fences are more heavyweight than acquire loads, and that lock-free algorithms are usually only efficient in low-contention scenarios where their inner cmpxchg succeeds most of the time. But hey, if there's an obscure edge case where this could possibly be beneficial, the C++11 model (which Rust inherits) obviously must account for it :wink:

There are two broad viewpoints on this question, which periodically clash on the internet.

One viewpoint states that as the strongest atomic ordering, SeqCst is most likely to be correct in a given situation, and should therefore always be used unless there is a proven performance benefit in using a weaker ordering.

Another viewpoint, which I personally adhere to, states that this policy of using SeqCst ordering as a default does more harm than good:

  • If you don't have a performance problem, you don't need atomics (just use a lock), hence being worried about the present and future performance of atomics is justified.
  • If you don't understand the synchronization implications of atomics well enough to see what the various memory orderings are useful for, you probably aren't ready to use atomics to synchronize other data than their own payload (which is what non-Relaxed orderings are about).
  • The set of algorithms for which Acquire/Release message passing is insufficient but SeqCst global program order is correct is small, and the examples which are most commonly cited (e.g. Peterson's lock) are toy algorithms that no one ever uses.
  • The algorithms that do rely on SeqCst's global program order for correctness are very hard to reason about, as verifying them requires enumerating all possible SeqCst instruction interleavings in program execution (remember that for N instructions, there are N! possible interleavings).
  • Using SeqCst as a "catch-all" ordering makes code ambiguous, and obscures the author's synchronization intent. In a program built according to this policy, seeing a SeqCst read-modify-write instruction does not tell you whether its intent is to merely modify the atomic's inner value (Relaxed), send data to other threads (Release), receive data from other threads (Acquire), exchange data with other threads (AcqRel), or establish some kind of global program execution order (the only thing that's truly specific to SeqCst). Hence the correctness of code that uses SeqCst everywhere is harder to verify.
4 Likes

Thanks for the very clear explanation.

As for the SeqCst question, I think I lean towards the first view. It's quite easy to know that you can use Relaxed when all you need is to prevent torn reads/writes, eg on a counter that has to produce a unique value with fetch_add, so I do use that.

From what I remember from Herb Sutter's talk, at least on x86_64 the only difference between AcqRel and SeqCst is that you can inverse the order of a release followed by an acquire. That's why I concluded that while there surely are situations in which this can enable some optimisations, they're probably slim.

That doesn't mean I think people shouldn't use AcqRel, but yes I think some benchmarks are in order and clear comments so everyone reading the code can verify why it's correct. Chances are probably slim as well that AcqRel introduces bugs (but pairing them correctly is important now if you sync using more than one memory location), but for us mere mortals the obviously vague "If you use SeqCst your program will do what you expect it to" is attractive.

2 Likes

I guess what I find most interesting when comparing those two worldviews is the different locations where people put the "atomics danger zone".

In the "SeqCst considered safest" view, as I understand it, atomics seem to be considered a relatively harmless language feature per se, and use of weak memory orderings delimitates the expert-only region where things start becoming scary.

In the "SeqCst considered harmful" view, as I express it, the danger zone is instead entered as soon as one uses atomics to synchronize data other than the atomic value itself, and once you got there, weak memory ordering doesn't add that much extra risk.


In any case, as Sutter points out in his talk, the main (but not only) reason why we have explicit atomic memory orderings are "relaxed" memory architectures like ARM or Power or, if you're feeling masochistic, that old DEC Alpha CPU whose crazy cache architecture defined the lowest common denominator of the Linux memory model :upside_down_face: .

On x86, Acquire and Release are free at the hardware level, only SeqCst comes at a cost, so weaker memory orders are only useful for compiler optimizations (many of which AFAIK aren't implemented yet in LLVM, which is currently super-paranoid about atomics but someday, hopefully, that will improve) and keeping your hardware portability options open (which in those days where lots of interesting ARM experiments like ThunderX and A64FX are being carried out may not be a bad idea).

2 Likes

As I have mentioned, I asked this question on the llvm-dev mailing list, the thread can be found here. From what I can gather the same question regarding the relative "strenghts" of release and acquire has come up before and has even be addressed by an proposal to adjust the C++ standard by removing the requirement that the failure ordering may be no stronger than the success ordering.

I am not 100% sure what to make of the linked LLVM source code, though. The isStrongerThan table lists both release -> acquire and acquire -> release as false, i.e., neither is stronger than the other, but the subsequent isAtLeastOrStrongerThan table lists the same relationships. Given this information, however, I don't think the argument that release/acquire should be prohibited because on failure it's actually a relaxed store does not hold any merit.

Thanks for linking the C++ proposal. That's an interesting read that explains it quite well. As I interpret it, the current API is mainly due to specifics for certain platforms that provide a CPU instruction for compare-and-exchange and not because of platforms where it's constructed with loads and stores.

Is there even a point in restricting success / failure orderings? On architectures with load-linked/store-conditional instructions the load and store are distinct instructions which can each have their own memory ordering (with appropriate leading/trailing fences if required), whereas architectures with compare-and-exchange already have a limited set of instructions to choose from. The current limitation (assuming [LWG2445] is resolved) only seems to restrict compilers on load-linked/store-conditional architectures.

I imagine the API was chosen to create a consistent story between both types of architectures, limiting the more flexible to the more restrictive.

However, this doesn't make sense. On failure, there is no store! There's only a load, so you can only compare it to other loads. So when you specify success: Release, failure: Aquire, the load for success is Relaxed, which is weaker than the load for failure, thus rejected by the current API.

Sorry I phrased that wrong, I was just trying to recap the argument I've read here and in the previous internals thread for justifying the behavior as it is currently implemented (i.e., panic when using compare_exchange with Release and Acquire) as briefly as possible.
What I am saying is, whether the load on success is only Relaxed or not is irrelevant for deciding whether this should be a "legal" combination or not, it only matters what the relevant standards says, or, if you are more practically minded, how LLVM implements that standard.