Question on rust atomic and locks: why does this load need to be `Acquire`

In this lazy initialization example, the author uses Acquire ordering to load the value or PTR, in line A.

use std::sync::atomic::AtomicPtr;

fn get_data() -> &'static Data {
    static PTR: AtomicPtr<Data> = AtomicPtr::new(std::ptr::null_mut());

    let mut p = PTR.load(Acquire);   // <---- line A

    if p.is_null() {
        p = Box::into_raw(Box::new(generate_data()));
        if let Err(e) = PTR.compare_exchange(
            std::ptr::null_mut(), p, Release, Acquire
        ) {
            // Safety: p comes from Box::into_raw right above,
            // and wasn't shared with any other thread.
            drop(unsafe { Box::from_raw(p) });
            p = e;
        }
    }

    // Safety: p is not null and points to a properly initialized value.
    unsafe { &*p }
}

She later explains the use of Acquire is to prevent the compiler from reordering initialization of the data with store operation.

This is the original text:

If we leave the
strict memory model aside and think of it in more practical terms, we could say that
the release ordering prevents the initialization of the data from being reordered with
the store operation that shares the pointer with the other threads. This is important,
since otherwise other threads might be able to see the data before it’s fully initialized.

From my understanding, the author tries to avoid this reordering:

    // ...

    // Now p may be null
    let mut p = PTR.load(Acquire);   // <---- line A

    if p.is_null() {
        if let Err(e) = PTR.compare_exchange(
            std::ptr::null_mut(), p, Release, Acquire
        ) {
            // ...
        }
        // This line is executed after PTR.compare_exchange now
        p = Box::into_raw(Box::new(generate_data()));

        // ...
    }

But how is it possible? This line

p = Box::into_raw(Box::new(generate_data()));

does not access any variables shared between threads, and it has a happen-before relationship with the following because they are in the same thread.

PTR.compare_exchange(std::ptr::null_mut(), p, Release, Acquire) 

As the author says,

The basic happens-before rule is that everything that happens within the same thread
happens in order. If a thread is executing f(); g();, then f() happens-before g().

I can only persuade myself with these two reasonings:

  1. Although p = Box::into_raw(Box::new(generate_data())); has a happen-before relationship with PTR.compare_exchange(std::ptr::null_mut(), p, Release, Acquire) in the same thread, other threads do not have this relationship and so may see the reverted order. This is weired, as from the thread that calls PTR.compare_exchange(std::ptr::null_mut(), p, Release, Acquire), PTR is set to point to the initialized data, no matter how other threads see it, PTR is guaranteed to be non-null if PTR.compare_exchange(std::ptr::null_mut(), p, Release, Acquire) succeeds.
  2. The acquire-load (let mut p = PTR.load(Acquire);) and release-store (PTR.compare_exchange(std::ptr::null_mut(), p, Release, Acquire)) pairing just improves correctness and prevents the compiler from making incorrect assumptions and reorder the instructions somehow.
1 Like

no, as you said, this reordering is not possible because the compare_exchange() has a Release success order.

this load-acquire actually prevents later dereference of the loaded pointer from being reordered.

i.e.:

let mut p = PTR.load(Relaxed);
// suppose the loaded pointer is NOT null
if p.is_null() {
    // this branch is irrelevent
}
let data = unsafe { &*p };
// without the acquire, this data access may see non fully initialized data
// even the loaded the pointer is non null.
println!("{}", data);

one alternative if you want to use Relaxed load for the pointer itself, is to use an acquire fence before dereference the pointer, but I don't think this makes any difference practically:

let mut p = PTR.load(Relaxed);   // <---- line A
if p.is_null() {
    //...
    PTR.compare_exchange(null_mut(), p, Release, Relaxed)
    //...
}
fence(Acquire); //<--- synchronize with the success `Release` of the CAS
unsafe { &*p }
1 Like

I think the order of the pointer load or fence can be weakened to the C++ memory_order_consume in this case, which, in theory, could gain you some performance, but rust doesn't support the consume order, and even C++ deprecated it, as it could only bring marginal theoritical performance gain, if any, and all (known) compilers don't bother to "properly" implement it and just promote it to memory_order_acquire anyway.

Release ordering does avoid the ordering, my question is if let mut p = PTR.load(Acquire); use Relaxed, is the reordering possible? IMO the answer is no.

Sorry I don't understand, how can this dereference see non fully initialized data? It appears to me when p is not null, it should point to the generated data with the compare_exchange call

The author's original words seem to suggest to reorder the initialization of the data and the compare_exchange operation.

If we leave the
strict memory model aside and think of it in more practical terms, we could say that
the release ordering prevents the initialization of the data from being reordered with
the store operation that shares the pointer with the other threads. This is important,
since otherwise other threads might be able to see the data before it’s fully initialized.

short answer is cache coherency.

the whole point of memory order is to establish the "happens-before" relationship.

in the writer thread, the initialization of the data "happens-before" the store to the pointer because of the Release order of the success CAS.

in the reader thread, you must pair it with a Acquire (I'll ignore Consume all together) load, to ensure the (atomic) load of the pointer "happens-before" any access of the data using that pointer.

the Acquire order prevents not only the compiler from reordering, but also affects hardware execution, usually in the form of memory barrier instructions.

suppose the pointer load is relaxed, although the compiler SHOULD not reorder the data load before the pointer load due to data flow dependency, on very weakly ordered archtectures, the hardware can still speculatively execute issue out-of-order the data access instructions before the load of the pointer even finished.

but even on architectures where such aggresive OOO scheduling is not implemented, you can still see stale data if you don't use Acquire load or fence. for example, suppose you use Relaxed to load the pointer, and the loaded address was 0x12345678, then you load some data through this pointer, the cpu then load the data, which might hit the local cache of the core, but the content is inconsistent with what the other core pushed to DRAM.

3 Likes

I think this paragraph is talking about the Release in the success case of CAS, but you are asking about the Acquire load, or am I misunderstanding your question?

both the Release store and the Acquire load are essential for the correctness of this implementation. so the paragraph is correct, but it's not directly explains your question.

the pointer load "synchronizes-with" the pointer store, Release ensures data initialization "happens-before" the pointer store, Acquire ensures the pointer load "happens-before" data access, all together they ensure the data is initialized before being accessed.

This is not true.

In the moment this is executed it might not appear this is shared. However this is writing to the memory location that will be pointed by p (implicitly through the Box::new(...)), and p will be later shared so that memory location will also become shared between threads.

However just observing the value of p is not enough to ensure you'll be observing the value that was written to its memory location. In particular in a thread that loads a non-null pointer the two writes (of the data to the memory location and the pointer to the static variable) might be reordered, which would cause it to read uninitialized data when reading from the pointer. Hence why it needs an Acquire ordering to properly synchronize with the code writing to the memory location.

3 Likes

I was just thinking about cache coherency. If the reader thread let mut p = PTR.load(Acquire); uses Relaxed ordering, it may just read the thread's own cache. The PTR.compare_exchange(std::ptr::null_mut(), p, Release, Acquire) and let mut p = PTR.load(Acquire); pair guarantees all threads PTR points to the initialized data.

Thank you for your detailed explanation!

You are correct. It is important for other threads to see p points to fully initialized data.

I think my previous confusion is rooted in this:

From the point view of the writer thread that wins the compare_exchange race and points p to initialized data,

p = Box::into_raw(Box::new(generate_data()));   // Line A

happens before

PTR.compare_exchange(std::ptr::null_mut(), p, Release, Acquire)  // Line B

But line A contains more than one cpu instruction, and the compiler and/or cpu is allowed to recorder these instructions with line B if B does not use Release ordering.

For example, say line A consists of two instructions, one instruction pointing p to the generated data, and one initializing the data. The compiler/cpu may reorder the instructions to

1. instruction_that_points_p_to_generated_data
2. PTR.compare_exchange(std::ptr::null_mut(), p, Relaxed, Acquire)
3. instruction_that_initializes_generated_data

After instruction 3 is finished, an observer from the same thread can always safely assume p now points to fully initialized data, though instructions reordering take place.

Now if one thread just finishes instruction 2, the other thread comes in, finds out p is not null, then it will see uninitialized data.

Also, as @nerditation points out, even if the writer thread has finished initializing data, other threads may just use data in their thread-level cache, e.g. storage buffer, leading to the generated data not synchronized between the threads. To make other threads see the newest value the writer thread just writes, we have to pair an Acquire load and Release compare_exchange.

Am I understanding it right?

I think you get the general idea, just some details might not be accurate, for instance, it is irrelevant how many "instructions" are in your generate_data() function, and the possibility of "reordering" is a consequence of OOO and inconsistence between multiple cpu cores, you may think it is the result of some "logical" order, but physically, it doesn't necessary mean the hardware "executes" the instructions in one order or another: modern processors do NOT "execute" instructions sequentially.

you can have a mental model about the order of instruction execution, if it helps you understand, just be aware the actual hardware is way more complicated than that.

general rule of thumb: Relaxed atomic operations will NOT create "happens-before" relations, Relaxed operations only synchronize with the same atomic variable themselves, so it is suitable for use cases like counters or simple boolean flags; for data dependent cases like locks or atomic pointers, use Release on the writer side, and use Acquire on the reader side.

note, the term "happens-before" is specific for concurrent scenarios. I say Relaxed operations won't create "happens-before" relations, that doesn't mean the compiler will generate "wrong" code: the compiler must still respect the "program order", just the "program order" is only guaranteed to be correct for the same thread.

btw, x86 implements the strongest memory order: all memory accesses are sequential consistent, so if you want to see how different memory orders affects the generated machine code, you should check other architectures with weaker order, such as riscv, powerpc, etc. here's an example of the assembly code for atomic loads with different ordering on the riscv architecture, you can check what kind of memory barriers are generated for each case:

2 Likes

Thanks, I learned a lot from your reply. I think I have a better mental model of ordering now :slight_smile:

This discussion shows why "reordering" is a really really bad way to think about atomics. It is more confusing than it is useful.

It's much better to think about "seeing old values". That is, when you read a memory location, you are not guaranteed to see the latest value written to that location. The acquire/release orderings help guarantee you don't see values that are too old. (This could be because your cpu cache contains an old value. The cache line must be flushed to get the newest value.)

The rule for acquire/release is that when an acquire load reads the value written by a release store, then all writes to any location that happened before the release write are guaranteed to be visible to anything that happens after the acquire load.

And in this case, the &*p happens after the acquire load, so it must see the write of generate_data() to that location since that write happened before the release store. Without the acquire/release, the &*p operation might see whatever data was there before generate_data() was written to that memory.

7 Likes