Understanding modification order in atomics

use std::sync::atomic::AtomicUsize;
use std::sync::atomic::Ordering::Relaxed;
fn main() {
    let counter: &'static _ = Box::leak(Box::new(AtomicUsize::new(0)));

    std::thread::scope(|s| {
        s.spawn(move || {
            println!("lexically 1st thread: {}", counter.load(Relaxed));
            counter.fetch_add(1, Relaxed);
        });

        s.spawn(move || {
            println!("lexically 2nd thread: {}", counter.load(Relaxed));
            counter.fetch_add(1, Relaxed);
        });

        s.spawn(move || {
            println!("lexically 3rd thread: {}", counter.load(Relaxed));
            counter.fetch_add(1, Relaxed);
        });

        s.spawn(move || {
            println!("lexically 4th thread: {}", counter.load(Relaxed));
            counter.fetch_add(1, Relaxed);
        });
    });

    let val = counter.load(Relaxed);
    println!("{val}");
}

one output:

lexically 2nd thread: 0
lexically 1st thread: 0
lexically 3rd thread: 0
lexically 4th thread: 0
4

as no thread observed any intermediate state, how do we suddenly have the correct output 4?
does fetch_add fetches memory from the global memory or the memory could be fetched from the cache too?

This scenario seems very simple: each thread can run counter.load, time passes and each of those calls finishes, and then each thread runs counter.fetch_add.

If you move the counter.fetch_add call into the println, you always see 0, 1, 2, and 3 (in some order).

why we don't see different outputs in this case? why always 0, 1, 2 and 3?

There are four atomic increments of the same location, so

  • one of them must have incremented from 0 to 1 and returned 0,
  • one of them must have incremented from 1 to 2 and returned 1,
  • one of them must have incremented from 2 to 3 and returned 2,
  • and one of them must have incremented from 3 to 4 and returned 3.

There are no guarantees about which one gives which outcome, but all four of those must happen before the thread scope ends.

On the other hand, when you write load() and fetch_add() as in your original code, those are separate operations, which do not together make a single atomic operation. Nothing requires any of the loads to be ordered after any of the fetch_adds; or in other words, there is nothing stopping the load()s from returning arbitrarily “stale” values, such as the initial value 0.

4 Likes

makes sense, but could you please explain why this returns the exact same order every single time :

use std::sync::atomic::AtomicUsize;
use std::sync::atomic::Ordering::Relaxed;
fn main() {
    let counter: &'static _ = Box::leak(Box::new(AtomicUsize::new(0)));

    std::thread::scope(|s| {
        s.spawn(move || {
            println!("lexically 1st thread: {}", counter.fetch_add(1, Relaxed));
        });

        s.spawn(move || {
            println!("lexically 2nd thread: {}", counter.fetch_add(1, Relaxed));
        });

        s.spawn(move || {
            println!("lexically 3rd thread: {}", counter.fetch_add(1, Relaxed));
        });

        s.spawn(move || {
            println!("lexically 4th thread: {}", counter.fetch_add(1, Relaxed));
        });
    });

    let val = counter.load(Relaxed);
    println!("{val}");
}

output:

lexically 4th thread: 0
lexically 1st thread: 1
lexically 3rd thread: 2
lexically 2nd thread: 3
4

why it doesn't print something like:

lexically 4th thread: 0
lexically 1st thread: 1
lexically 3rd thread: 1
lexically 2nd thread: 1
4

do threads agree before hand that:
=> when the value is 0 thread 4 gets to increment
=> when the value is 1 thread 1 gets to increment
=> when the value is 2 thread 3 gets to increment
=> when the value is 3 thread 2 gets to increment
something like this?

The hardware doesn’t know anything about threads and doesn’t do any coordination “beforehand”. Each time a fetch_add() executes, it is required that the individual operation is atomic in the same sense that database transactions are atomic; that is, they are indivisible. No other loads or stores can observe or change the value in the middle of the fetch_add operation.

So if a fetch_add(1) loaded 1 and returned 1, it is guaranteed that[1] no other fetch_add(1) can have loaded 1 — just as if you used a Mutex<usize>, but more efficient. This guarantee holds between any two fetch_add(1)s, regardless of what threads or what cores execute them.


  1. (in this program which has no subtractions and no overflow) ↩︎

2 Likes

oh, I see. Also, can we say that while using fetch_add (or any fetch method) or compare_exchange one may not worry about the global shared memory and the cache being different...will they be consistent?

The caches in CPU cores are an implementation detail. Knowing about them can tell you things about why atomic operations are designed the way they are, and why you might get specific outcomes within the rules, but the CPU (and the compiler generating code for the CPU) is responsible for giving you outcomes that always obey the rules of atomic operations.

3 Likes

Can I say the next thread which gets scheduled after a successful increment (using RMW) will see the incremented value, provided that scheduled thread also uses RMW i.e. everything as a single atomic instruction under any memory ordering?

That's a model which works for single-core systems, but in multi-core, there is no single "the next thread which gets scheduled"; multiple threads are actually running simultaneously and the cores they run on have separate caches. Correct behavior of atomics is maintained through cache coherency protocols performed by the cores.

2 Likes

Like @kpreid said the CPUs are running "concurrently", which is a precise term in this context.

For a bit of intuition: Image you and I were coworkers at a software company, and on Monday we both wrote a bunch of code in our respective Git branches. On Tuesday we both make PRs and encounter a merge conflict.

This was "concurrent" work, and the important thing to note here is: it is completely irrelevant whether:

  1. We both did our work at the same time during the day
  2. I did my work in the morning, wheras you did yours in the afternoon
  3. I did have of my work in the morning, half in the evening, and you did all of yours in afternoon
  4. [insert the infinity other possible orderings]

In distributed-systems and concurrent-programming this is an important concept to grasp - you don't actually gain any understanding by trying to think about "whats physically happening first", you only think about the rules that must be followed, and the temporal relationships.

Similarly, "what time of day did you write your code?" would not factor into our decisions about merge-conflict handling. It was concurrent work.

3 Likes

I think to clarify this question in general also, your questions about these code examples actually aren't really about "memory ordering" in particular.

I'll ignore the code that has to run for println! for the following explanation just to keep it simple.

Your first code example

s.spawn(move || {
    println!("lexically Nrd thread: {}", counter.load(Relaxed));
    counter.fetch_add(1, Relaxed);
});

This code has two different accesses to counter, A thread running this code will have to run (at least) 2 instructions - one to do the atomic load, and one to do the atomic fetch-add. This means that, concurrently, any arbitrary amount of "other stuff" could happen in the cpu between the first and second instruction, and if you have (say) four different threads running this code at once, you could experience any possible interleaving of the instructions that you could imagine. For example lets imagine we have only two threads. Given the whims of your scheduler and CPU archiecture the following is entirely possible:

  1. thread-A: counter.load(Relaxed) // This will load 0
  2. thread-B: counter.load(Relaxed) // This will load 0
  3. thread-A: counter.fetch_add(1, Relaxed) // this will store 1
  4. thread-B: counter.fetch_add(2, Relaxed) // this will store 2
    Outcome: 0 0

Alternatively, and also completely possible:

  1. thread-A: counter.load(Relaxed) // This will load 0
  2. thread-A: counter.fetch_add(1, Relaxed) // this will store 1
  3. thread-B: counter.load(Relaxed) // This will load 1
  4. thread-B: counter.fetch_add(2, Relaxed) // this will store 23
    Outcome: 0 1

These are both entirely possible with the same code - it is just up to the scheduler, your cache coherency protocol, and a bunch of other complex and largely out of your control factors. There was no "agreement" between threads or coordination or anything like that, they all just looked at whatever counter happened to be when they happened to get to that respective line of code.

Now what about your other code example?

s.spawn(move || {
    println!("lexically Nth thread: {}", counter.fetch_add(1, Relaxed));
});

This code (again ignoring println!ing) now only does one "access" to the counter variable, which it can do in a single instruction. The threads can still run this code in any arbitrary relative-order to eachother, but because they all are only running a single instruction, reorderings don't have nearly as much of an effect. Consider a couple different possibilities the scheduler could spit out at you, pretending we have three threads:

  1. thread-A: counter.fetch_add(1, Relaxed) // loads 0, stores 1
  2. thread-B: counter.fetch_add(2, Relaxed) // loads 1, stores 2
  3. thread-C: counter.fetch_add(2, Relaxed) // loads 2, stores 3
    Outcome: thread-A: 0; thread-B: 1; thread-C: 2

Or:

  1. thread-C: counter.fetch_add(1, Relaxed) // loads 0, stores 1
  2. thread-A: counter.fetch_add(2, Relaxed) // loads 1, stores 23.
  3. thread-B: counter.fetch_add(2, Relaxed) // loads 2, stores 34. 4.
    Outcome: thread-C: 0; thread-B: 1; thread-C: 2

To make it more complicated: in reality the println! itself is of course not atomic, so you could have the threads also load the values in a different order than they end up printing, giving you something like this:

  1. thread-C: counter.fetch_add(1, Relaxed) // loads 0, stores 1
  2. thread-A: counter.fetch_add(2, Relaxed) // loads 1, stores 2
  3. thread-A: println!(...)
  4. thread-B: counter.fetch_add(2, Relaxed) // loads 2, stores 3
  5. thread-C: println!(...)
  6. thread-B: println!(...)
    Outcome: thread-A: 1; thread-C: 0; thread-B: 2

Welcome to the joys of concurrent programming. Keep in mind too, as per my comment immediately above this one, these "orderings" are somewhat "made up" for the sake of example - in reality a lot of these things are happening genuinely concurrently (assuming a multicore system), even if in the end you can decide some temporal ordering of them that "fits" the observed result.

(PS: If my examples have any mistakes let me know)

2 Likes

hey, I cannot find any good resources for this...could you please suggest any : )
(apart from the wikipedia link)