Can an unused atomic load with `Acquire` ordering ever be reordered or optimized out?

Heya! When I write atomic code I often write a pattern where I loop over a Relaxed load on some lock, and then later use a full fence(Acquire) to ensure the value the lock protects is visible on the current thread.

I've recently been doing some updates to my collection of atomic data structures, and I've been wondering if I can switch to using a lock.load(Acquire) after the lock.load(Relaxed) loop to ensure Acquire ordering. In this case, the Acquire load's result would be unused and I get mixed responses about whether or not this can be optimized out. Personally, my view is that in theory the load should be able to be optimized out, however the Acquire semantics must still apply as if the load occurred at that location in the code.

Here's an example of a very common code pattern I have for insert-only structures:

#![feature(maybe_uninit_uninit_array, sync_unsafe_cell)]

use std::sync::atomic::{AtomicUsize, Ordering};
use std::cell::SyncUnsafeCell;
use std::mem::MaybeUninit;

fn main() {
    const N: usize = 64;
    
    // Locks indicate if a value has been initialized
    // 0 - Free
    // 1 - Writing
    // 2 - Initialized
    let locks: [AtomicUsize; N] = std::array::from_fn(|_| AtomicUsize::new(0));
    
    // Cells are uninitialized
    let cells: [MaybeUninit<SyncUnsafeCell<u32>>; N] = MaybeUninit::uninit_array();
    
    std::thread::scope(|s| {
        for _ in 0..2 {
            s.spawn(|| {
                // Observer thread
                for ii in 0..N {
                    // Spin until value is ready
                    while locks[ii].load(Ordering::Relaxed) != 2 {
                        std::hint::spin_loop();
                    }
                    
                    // Ensure value is visible on this thread
                    std::sync::atomic::fence(Ordering::Acquire);
                    
                    // Desired >>> THIS IS THE QUESTION <<<
                    //locks[ii].load(Ordering::Acquire);
                    
                    // Read value
                    unsafe {
                        core::ptr::read_volatile(SyncUnsafeCell::raw_get(cells[ii].as_ptr()));
                    }
                }
            });
            
            s.spawn(|| {
                for ii in 0..N {
                    // Fight to get first ownership of value
                    if locks[ii].compare_exchange(0, 1,
                            Ordering::Relaxed, Ordering::Relaxed).is_ok() {
                        unsafe {
                            SyncUnsafeCell::raw_get(cells[ii].as_ptr()).write(5);
                        }
                        
                        locks[ii].store(2, Ordering::Release);
                    }
                }
            });
        }
    });
}

The code above uses a monotonically increasing AtomicUsize that progresses from 0 (free) to 1 (value is being initialized) to 2 (value is initialized). Transition from 1->2 requires Release ordering as we're using the lock to protect unrelated memory (the cell).

For readers, it is only safe to read the cell if the lock has been observed to be 2 (initialized) and an Acquire fence has occurred.

The question is, is it safe to use locks[ii].load(Acquire) rather than fence(Acquire) here. As this is in theory a weaker barrier and thus could theoretically result in better code generation.

My view would be that yes, it should be fine, as the Acquire semantics should apply regardless of if a load actually occurs/is used. However, I can't really find concrete verbiage that would indicate that's the actual rules?

Note: I'm not really interested in what compilers do, but rather what they are allowed to do. Right now it would appear that LLVM never optimizes out a load, regardless of if it's a duplicate load or even Relaxed, although I would say that merged Relaxed loads should be allowed in theory. The question really comes down to how does this behave when ordering is at play.

As to why I use a Relaxed loop followed by a fence, instead of just using Acquire loads in the loop is due to a Relaxed load almost always being cheaper, and thus code generation is often better for weakly ordered systems.

Let me know if there's any more clarification I should add! I know it's kinda a weird question!

Example codegen of 3 ways of accomplishing this with dramatically different codegen: Compiler Explorer

-B

What's the difference between __sync_val_compare_and_swap_4 and __sync_synchronize? I'm guessing the ideal assembly would be like the fence but with bl __sync_val_compare_and_swap_4.

aarch64 is probably a better example Compiler Explorer since it doesn't use intrinsics. In this case, there's an acquire read instruction which can be used. Using fence is a full fledged dmb ishld (fence instruction)

1 Like

I think this is fine. Atomics guarantee read-read coherence which effectively means that locks[ii].load(Ordering::Acquire) will see 2. The value doesn't need to be checked as the program already knows the value in this case, but the load is still acting as acquire load.

I think an optimizer can choose to eliminate the dead load by replacing the ordering in spin loop with acquire ordering according to as-if rule. See N4455 No Sane Compiler Would Optimize Atomics.