Happens-Before Relationship by Relaxed memory ordering in single and multiple threads

My understanding about the happens-before relationship after reading Mara Bos's book is that Relaxed memory ordering only ensures this relationship in a single thread. Relaxed memory ordering does not ensure a happens-before relationship among multiple threads.

Here is a code snippet taken from that excellent book:

static X: AtomicI32 = AtomicI32::new(0);
static Y: AtomicI32 = AtomicI32::new(0);

fn a() {
    //operation1
    X.store(10, Relaxed); 
    //operation2
    Y.store(20, Relaxed);
}

fn b() {
    //operation3
    let y = Y.load(Relaxed);
    //operation4
    let x = X.load(Relaxed);
    println!("{x} {y}");
}

The function a() is executed by thread1 and the function b() is executed by thread2.

How can the output 0 20 be possible from this code snippet?

20 is the value of Y. 20 sets to Y in thread1 by the operation2.

The operation3 must be executed to load the 20 from Y to y in thread2.

0 is the value of X . 0 is set to X during the initialization process which is outside of the thread.

X loads to x at operation4, X and x must be 0 to match the output.

So execution order of the operations is as follows:
2 -> 3 -> 4 -> 1. However, as Relaxed memory ordering ensures the happens-before relationship in thread1 that means 1 happens before 2. So this order of operation does not seem accurate. As a result, the output 0 20 seems impossible in practice.

I think the output 0 20 is still possible in theory if the execution of operation1 is not observed in thread2 but operation2 is observed in thread2.

Theoretically, the output 0 20 is possible only because there is no happens-before relationship between thread1 and thread2. This reasoning abstracts away all the nitty-gritty details of what happens inside the threads. Or there is no happens-before relationship between threads so thread 2 observes the operation1 and 2 might appear to happen in the opposite order.

What could be the more precise, sane, and logical reasoning for having the output 0 20 by code snippet?

As an example real-world setup where you could see 0 20, imagine having the two atomics in two different pages of memory, on a system with out of order execution and weak memory ordering semantics. Note that this is simplified, but possible on AArch64.

In the thread running a, writing to X finds that this CPU core does not have exclusive ownership of the memory location for X. It starts a cache coherency transaction to get ownership of that memory location, and tries to keep executing, until it hits a dependency on completing the write to X. This allows it to write to Y before it's written to X, because there's no dependency involved that would block this.

In the thread running b, the CPU core finds it doesn't yet have shared ownership of the memory location for Y; it starts a cache coherency transaction to get shared ownership of Y, and moves on, looking for more work it can do. It finds it has shared ownership of the memory location for X, and satisfies that load, then receives the cache coherency request for handing over ownership of the location X to the other CPU core. It hands over ownership of this location, then waits to finish getting shared ownership of the location for Y; once it has shared ownership, it finishes the load of Y, and then can continue to call println!, which depends on both reads completing, and therefore can't start until they've both finished.

Edit:

The important thing to be aware of is that the CPU and compiler are free to execute your code out of program order as long as two things hold:

  1. Within a thread, you cannot observe the out-of-order execution; the system is set up so that whenever you do something that, on a single thread, would observe a change in ordering, that change can't happen.
  2. Between threads, there's no breach of the "happens-before" relationships, even though one thread may be able to observe that a different thread was executed out of order.
6 Likes
  • Relaxed ordering doesn't ensure any happens-before relationship.
  • Operations in a single thread are always subject to the happens-before relationship as defined by program source order.
  • The only things that Relaxed ordering guarantees is that the modification of a variable indeed happens atomically, i.e. there is no tearing, and for more complex compare-and-swap or fetch-modify operations the sequence really happens atomically, with nothing interfering in the middle (e.g. between a read and modification).
  • On most (all? I'm not sure) processors, aligned loads and stores of primitive processor types are always atomic. In particular, AtomicU32::load(Relaxed) on x86/x86-64/aarch64 is just a simple load, and similarly AtomicU32::store(Relaxed) is just a simple store, like with any other variable in your code.
  • You misunderstand what happens-before means. It doesn't mean that the compiler will literally output operations in the same order. It just means that the compiler must act as if the operations really happened in order, thus putting bounds on possible observable behaviour. In particular, there is no way that loads of independent variables can observe each other (what would that even mean?), thus loads can happen in any order, regardless of the kind of operations you use.
  • Enforcing specific order for loads and stores requires volatile accesses. They are entirely unrelated to atomics and multithreading. Volatile atomic accesses are a separate concept which doesn't currently exist in Rust.
  • Basically, the happens-before relationship only matters when talking about loads and stores. If a load and a store to a variable do not have a happens-before relationship in either direction, then you have a data race. But it doesn't make much sense to talk about happens-before between only loads or only stores, because it doesn't affect observable behaviour. In fact, if you're only doing loads you don't even necessarily need to use atomic accesses, e.g. there should be no problem in mixing atomic and non-atomic loads from the same variable (but of course you need to make sure that neither of those race with any writes).

That's a ridiculously overcomplicated example. The 0 20 result can already happen on x86, regardless of variables' placing in memory. The stores are independent (constants stored to independent memory locations), thus they may be freely reordered by the CPU, regardless of whatever source level tricks you do. Similarly, the loads can be freely reordered. And since the loads and stores happen in different threads, there is no happens-before relationship between them (except that a load of a variable must read some value previously stored to it), thus the operations may happen in any sequence, including 2-3-4-1.

3 Likes

The 0 20 result cannot happen on x86, because the x86 memory model does not have a pure Relaxed load or store; all loads on x86 are at least Acquire ordering, and all stores on x86 are at least Release ordering, and x86 systems must prevent reorderings that are invalid for the happens-before relationships that this implies. The compiler could reorder the stores, but is unlikely to.

Similarly, on practical AArch64 CPUs (I did check this example against ARM's models), you will never see reordering inside a single cache line, nor will you see reordering where two cache lines are already in L1 cache; you need the cache coherency transaction to cause the first access to be delayed but not the second access, otherwise the CPU core will never take advantage of its freedom to reorder, since it only practically reorders where there is an advantage to doing so.

And I felt that, rather than a blanket assertion of "reordering is allowed", which just leads to "but why would a CPU reorder?", it'd be clearer to use an example of reordering where the reason for doing so is that latency of an operation is hidden - for the AArch64 CPU I checked against, L1 cache latency is 3 cycles, while L2 cache latency (assuming the line is already in L2 cache) is 12 cycles, and thus the reordering allows the second atomic to complete while the first is still in-flight (since the first atomic starts at time 0 and takes 12 cycles, while the second atomic starts at time 1 and takes 3 cycles, thus finishing before the first one).

1 Like

Just because operation 1 happens before operation 2, that does not mean that thread B will see the result of operation 1 before it sees the result of operation 2. The write to X could get "delayed" causing thread B to see the write to Y but not the write to X, even though the write to X happened first on thread A.

Yes, 1 happens before 2. But thread B can still see 2 before it sees 1. Forget execution orders completely. You cannot use them to reason about this kind of thing. Thinking about writes becoming visible to other threads in a delayed manner is a much less confusing way to think about it.

2 Likes

Thank you for your post.

The way I understand "delayed" is that it doesn't necessarily mean the write operation to X is actually slower than the write to Y or there is a time delay in the usual sense, but the effect of write to X is not observed(or takes more time than the Y) by threadB as Relaxed ordering does not ensure the happens-before relationship between threads. Does this understanding align with your perspective?

The delay is in the sense that thread B reads an old value even though the write has already finished. It's not because the write is "slow" or "still ongoing" on thread A. It's just that thread B has some optimization to make reading the value cheaper that causes it to see an old value.

This wording of "seeing an old value" also makes sense in the context of release/acquire. What those orderings guarantee is this: If an acquire load sees the value written by a release store, then anything that happens after the acquire load is guaranteed to see anything that happened before the release store.

So by this example, if you change the operations on Y to be acquire/release instead of relaxed, then 0 20 cannot happen. This is because X.load(Relaxed) comes after the acquire load, so it must see the X.store(10, Relaxed) because the store to X comes before the release store. (Of course this assumes that Y.load(Relaxed) returned 20. Otherwise it did not see the value written by the store, so the guarantee does not apply.)

4 Likes

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.