Can I rely on rustc to eliminate duplicated atomic loads?

This is a question with regard to rustc's internal optimization heuristics.

When using certain generic templating or macro techniques, one would frequently run into some certain code patterns. Those patterns, after being inlined and expanded, would produce duplicated atomic loads in one expression, which, assuming no interference from another atomic store from another thread, can be safely deduplicated.

One example would be

use std::sync::atomic::*;

trait Visitor {
    type Node;
    fn should_visit(node: &Self::Node) -> bool;
    // .......
}

struct Node {
    available: AtomicBool,
    visited: AtomicUsize,
}

struct Visitor1;
impl Visitor for Visitor1 {
    type Node = Node;
    fn should_visit(node: &Node) -> bool {
        node.available.load(Ordering::Relaxed)
    }
}

struct Visitor2;
impl Visitor for Visitor2 {
    type Node = Node;
    fn should_visit(node: &Node) -> bool {
        node.available.load(Ordering::Relaxed) || node.visited.load(Ordering::Relaxed) < 10
    }
}

fn combined_visit<N, V1: Visitor<Node=N>, V2: Visitor<Node = N>>(node: &N) {
    if V1::should_visit(node) || V2::should_visit(node) { // This expression when inlined and expanded, would have two redundant atomic loads!
        //..........................
    }
}

If those were not atomic loads and were simple non-thread-safe data, we can reasonably expect rustc to deduplicate the expression to some extent. However, atomic operation complicates the question. And the ordering would further complicate the stuff.

Well, after adding std::hint::black_box to the above minimal example, the answer seems to be no. This most simplistic case cannot be deduplicated.

Then my question would be: how to workaround this restriction to achieve zero cost abstraction. Unsafe deref a raw pointer seems to clearly do the trick but I wonder if there is any cleaner way.

The compiler should not deduplicate loads. It can be mutated between the 2 loads from some other threads.

Zero cost abstraction means the cost of the abstraction itself can be zero. When the implementation have some cost, you can't magically make them zero cost by abstraction.

2 Likes

I just spent some time digging into the topic. Here seems to be a relevant talk CppCon 2016: JF Bastien “No Sane Compiler Would Optimize Atomics"

It seems that under certain scenarios, C++ compilers were allowed to optimize atomic operations, assuming that certain types of data races would not occur and that the optimization does not introduce more races. I haven't digged deep enough on its definition but that would most likely help in the scenario I posted above.

When working with atomics there is no such thing as a "redundant load" that can be optimised out.

One of the core rules around optimisers is that you can only make a change as long as you don't change the observable behaviour of the program. However, it's always possible for another thread to change the atomic value between V1::should_visit() and V2::should_visit(). That means removing the second atomic load would mean that you go from seeing a change in that atomic variable no longer seeing it, which would modify your control flow and be an observable change in behaviour.

If it's important to you to avoid the second load at all costs, explicitly load that atomic into a local variable at the start and pass its value into your should_visit() functions. There's no other way to optimise the second load out.

5 Likes

I believe this is false. The compiler is allowed to optimize this. It's just that LLVM doesn't currently optimize it.

Yes it is allowed that the value gets changed between the two loads by another thread. But it is also allowed that the change in the other thread happens before or after the two loads, as there is no synchronization with the other thread between the two loads. All three orderings are allowed, and the compiler is allowed to pick whichever one it wants, and thus it is free to prevent the change in the other thread from happening between the two loads. Thus, it is allowed optimize this to just do one load.

The compiler is not allowed to optimize away read_volatile operations, but it is allowed to optimize away atomic load operations.

4 Likes

If this was true, it would be impossible to implement a loop over atomics that polled it until it had a specific value. Also it would break seqlocks. And atomics have no load_volatile operation that would be needed in that case.

The compiler is definitely allowed to merge two adjacent loads of the same value under the as-if rule. This is because having both loads see the same value is an allowed execution of a program with two adjacent loads.

This isn't an allowed optimization, because it takes a piece of code without an infinite loop and introduces an infinite loop, so it changes the behavior of the program.

Whether such a loop is guaranteed to work or not depends on whether there is a forward progress guarantee for each thread or whether a thread is allowed to be starved forever by other threads.

In C++ this depends on the platform (it's implementation-defined in the standard). On some platforms there is such a guarantee, on others there isn't. I would guess it's the same for Rust, since Rust doesn't have its own official memory model and it defers its atomic semantics to C++. So on those platforms that have the guarantee, the busy wait is guaranteed to work correctly and the compiler isn't allowed to change it. On those platforms that don't have the guarantee you may indeed get an infinite loop if you do such a busy wait.

If the other thread has actually changed the value, the C++ model of atomics (which also applies to Rust) guarantees that it must be visible to atomic reads in other threads eventually, so the compiler can't turn the busy wait into an infinite loop in that scenario. Although technically this might also be implementation-defined because the word "should" (rather than "shall") is used:

An implementation should ensure that the last value (in modification order) assigned by an atomic or synchronization operation will become visible to all other threads in a finite period of time.

What about seqlocks though? Where you read a piece of data in between two reads of an atomic, and you know the read was correct if the atomic read the same value twice (and was even, to indicate that no write was in progress).

If the compiler is allowed to eliminate one of those loads, seqlocks break.

You would use non-relaxed Ordering for this. The Ordering is there to enforce a certain order of reads and writes, and in this scenario the compiler isn't allowed to reorder the operations in a way that breaks inter-thread synchronization, so it will have to read the atomic twice.

Right, it wasn't clear to me from your original post that you were only talking about relaxed ordering. Then it makes sense that it would be allowable to remove some reads.

I would have to think about what rules apply to what combinations of memory orderings otherwise. It isn't entirely clear to me how it would work in each situation.

It's true with any ordering. However, with the seqlock there are other operations between the two loads, and those extra operations change the situation.

1 Like

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.