Misunderstanding in Acquire/Release. Memory ordering

Please explain Acquire/Release to me. I have read the definitions and examples many times, but I couldn't understand one thing - guarantees.
What I understood 100% is that it has nothing to do with thread or instruction locks. Atomics blocks threads at the instruction level, and there are also system locks at the kernel level to block threads explicitly.
And what this has to do with is the barriers and ordering of memory accesses. These accesses may be executed chaotically by default, so we can order them with Acquire/Release.

For example, the following code:

pub fn change(&self, f: impl FnOnce(&mut T) -> T) {
        let load_data = self.ptr.load(Acquire);
        let changed_data = f(&mut unsafe { &*load_data }.clone());
        self.ptr.store(Box::into_raw(Box::new(changed_data)), Release);
    }

The change function loads the old data, changes it, and then saves the changed data (implied atomically).

Based on this code, I will explain my problem.

So, I have a perhaps false impression about persistence: When the first thread executes acquire-load, it means that the parallel thread that wants to execute the same acquire-load will see that the data in memory is already locked (supposedly the barrier means just that) and will wait for it to be unlocked (that very waiting) until the data is released by some Release.

But understandably this is not the case. If it were so, if Release didn't happen, the second thread would wait forever, which would look strange. And what happens is only what is called guarantees, and it does not affect thread locking in any way, neither through memory, nor through instructions, nor in any other way.
The guarantees that are given in this code are only: if the second thread encounters changed data by another thread after change-store-Release, it means that the 2 lines before it 100% happened, i.e. the other thread 100% processed them.

But because of the misconception and uncertainty I can't fully accept: does this mean that a parallel thread has the possibility to change self.ptr.store before the first thread, thus causing a data race (inter-thread), which may end up giving irrelevant data. And Acquire/Release has no effect on this, all they will guarantee is "if another thread sees self.ptr.store from the second thread, the 2 lines before it happened 100%", which is not important compared to atomic locks.

That is, I still have to check the data when I try to do store, because in one thread between load and this store (which happened in strict order, thanks to our unimportant guarantees) another thread could have already changed the atomic variable in parallel.
That is, it still needs to:

pub fn change(&self, f: impl Fn(&mut T) -> T) {
    let mut load_data = self.ptr.load(Acquire);
    
    loop {
        let changed_data = f(&mut unsafe { &*load_data }.clone());
        match self.ptr.compare_exchange(load_data, Box::into_raw(Box::new(changed_data)), Release, Relaxed) {
            Ok(_) => break,
            Err(e) => load_data = e,
        }
    }
}

to ensure that the data is relevant.

There is no locking of threads, and atomics by themselves don't prevent data races. Atomic operations are very primitive. But they can be used to implement abstractions such as locking (mutexes, etc), which can prevent data races. Atomic operations impact the behavior of both the compiler and the hardware. But they don't interact with the OS.

Very roughly: Acquire/release (with load/store) prevent reordering of memory reads and writes by the compiler or the hardware across those instructions, such that they guarantee that one processor (which you can think of as a thread, but it's really a CPU core) doing load-acquire, sees the changes to memory made earlier up to the point of a store-release on another processor. This dependency only happens when the same atomic variable is used for the load and the store.

Because they cache memory values, CPUs sometimes have to communicate with each other to make this work, which is why there is a cost to synchronizing with atomics on some CPU architectures. Unfortunately the details of what they do is CPU architecture specific, which is why we have to follow the more abstract rules in the C/C++ memory model and the Rust memory model, which happen to be the same.

I know you said you have read everything, but still I think it is worth reading the Acquire-Release section in the nomicon, if you haven't already, which explains it fairly well I think.

How to use atomics to implement your own algorithms is another topic, and really requires study. Looking at existing code, like a very simple spin lock implementation, is one way to get a feeling for it.

6 Likes

(Assuming ptr is is sane initial state and only using shown function. (using one function/branch is uncommon.))
First code, no data race but a race condition; which you seem to be pointing out as irrelevant data.

Second code you make a data race (not allowed in correct code) by using Relaxed and needs to be fixed by Acquire.
Allowing the closure to be called multiple times opens possibility of additional side effects. (Down to implementation.)

What you do in both is leak memory. There is no easy fix (for what I see) without spin lock or other tracking. Your unsafe code in only OK as the leak is only read (by clone.)

A possible common alternative is swapping the pointer for null. When you have non-null returned you have gained the value exclusively so can Box::from_raw

1 Like

I don't know what you mean, but atomics only affects memory access intructions. the processor is free to do its out-of-order thingy, as long no data dependency exists, just as usual.

what atomic really do at the instruction level is it issues a memory barrier (the type of which is specified by the memory order).

if you find the rust documentation is not enough to understand the details, you can read the memory order topic for C++, they are basically the same (minus the consume ordering for atomic loads, which is not widely used by C++ developers anyway)

that's complete false. atomic memory access is not "locked" in any means, and the performance impact really comes from the memory barrier/fence. in following description, I would assume you already have a good idea about the cache hierarchy of the processor.

NOTE, below is conceptual model, the hardware can utilize clever cache coherent protocols to achieve the same semantics while trying very hard to reduce the performance cost whenever possible.

when a Release store (or explicit fence) is executed by the processor, it will flush all the dirty cache lines (not just the line containing the target "store" location) to ensure all memory writes (atomic or not) before this Release is "globally visible", and this is potentially very expensive (because of RAM has very high latency, compared to the processor's clock frequency). and in practice, this will stall the entire processor.

accordingly, when a Acquire load (or explicit fence) is executed, the processor will mark all cache lines as invalid. this operation in and of itself might not be as expensive as a Release, but later memory loads will miss the cache and reach the main memory.

on the compiler side, the memory order (or explicit fences) also affects the optimizer too, specifically, it prohibits the compiler from re-ordering certain memory accesses, thus the term "memory order".

in particular, when there's a Release store, the compiler is not allowed to move any earlier stores (as appeared in programming order) into positions after the atomic store; similarly when there's a Acquire load, the compiler is not allowd to move later loads before the atomic load.

note, the "synchornize with" relation only applies to atomic load/store pairs to the same location, and when the atomic load sees the value "released" by the atomic store.

using an atomic value as a "ready flag" in a producer/consumer scenario:

  • in the producer thread, the store of the data "happens before" the "releasing" the atomic value,
  • and in the consumer thread, the "acquiring" of the same atomic value "happens before" the loading of the data,
  • transitively, this guarantees the order that the production of the data "happens before" the consumption of the data.
2 Likes

Reads (loads) are always "free". There is no synchronization for any memory reads [1]. It is only writes (stores) that cause synchronization by invalidating non-local caches. The acquire-release semantics are for memory ordering, which is a related concept but not synchronization itself.

Load-acquire means that memory accesses after the load will not be reordered to before the load. Similarly, store-release means that memory accesses before the store will not be reordered to after the store. Here's a helpful diagram from Acquire and Release Semantics:

image

TL;DR: The memory orderings prevent the compiler and CPU from reordering memory accesses.


  1. With the obvious exception for cache misses. ↩ī¸Ž

3 Likes

Perhaps the explanation of acquire/release in chapter 3 of Rust Atomics And Locks by Mara Bos will be of use to you. That chapter on memory ordering is what finally made it click for me.

In fact that entire (free) ebook is well worth reading if you are interested in or working with low level concurrency.

4 Likes

This book is my first experience with low-level parallelism. This is where I got the misunderstanding

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.