Guarantees regarding atomicity and ordering with async/.await

Hi,

Let's assume I want to simulate some real-world phenomenon like planets orbiting a star. Perfect use case for concurrency. However, since I do not want to deal with nondeterminism and heisenbugs, I decide not to use OS threads. Fortunately, Rust's async/.await is here to save the day!

Now, I would still like to produce fast code and therefore do not want use synchronization unless it is strictly necessary for correctness.

Let's consider the following code (please note the comments that identify (critical) sections):

fn main() {
    let x: Rc<RefCell<usize>> = Rc::new(RefCell::new(0));
    let clone = Rc::clone(&x);

    let h1 = spawn_local(async move {
        // Beginning of section A
        // some computation
        // End of section A
        Fut::new().await;
        // Beginning of section B
        *x.borrow_mut() = 1;
        *x.borrow_mut() = 2;
        // End of section B
    });

    let h2 = spawn_local(async move {
        // Beginning of section C
        println!("{:?}", clone);
        // End of section C
    });

    block_on(h1);
    block_on(h2);
}

Assuming Fut (some dummy Future) is implemented, this code works with async-std 1.8.0. However, for simplicity, it can be reduced to something like this (which is not valid Rust of course):

fn main() {
    let mut x = 0;

    spawn(async {
        // A
        // some computation
        Fut::new().await;
        // B
        x = 1;
        x = 2;
    });

    spawn(async {
        // C
        println!("{}", x);
    });
}

Intuitively, I expect this code to be executed as though each section was protected by a mutex, in which case the following execution paths are possible: ABC, ACB, CAB. (ABC, for instance, means that section A runs first, then B and lastly C)

For each interleaving, the expected output is:
ABC: 2
ACB: 0
CAB: 0

But is that behaviour guaranteed? Or can the compiler/hardware reorder instructions, for example, like so?

fn main() {
    let mut x = 0;

    spawn(async {
        // A
        // some computation
        x = 1;
        Fut::new().await;
        // B
        x = 2;
    });

    spawn(async {
        // C
        println!("{}", x);
    });
}

Importantly, was this possible, the program could now print 1 (when it takes the path ACB) which was not possible before.

Edit: Added some extra context in my follow-up post:

That depends on what executor you are using. A will always come before B, but C could come in between depending on what executor you are using

In your code, you are using spawn_local. Hence, everything is running in the same thread, and context switching can only possibly happen at an .await point. Of course everything running in a single thread might give you less speedup than you were hoping for (although I’m not 100% sure if multi-threading is what you’re after). Importantly, in Rust it itsn’t even possible that a variable protected by RefCell could be accessed (by reference) by multiple threads.

Of course when you change to something like async_std::task::spawn that does allow for execution in multiple threads in parallel, you need to choose your synchronization primitives to still allow for mutation of x from multiple places. If you go with atomics (e.g. AtomicUsize) then bad interleavings are possible. If you use some kind of Mutex or something similar, then the critical sections are explicit anyways.

The assignments will not moves across the await. If it runs in one thread, printing 1 is not possible.

This is too abstract. It should be: "Perfect use-case for parallelism". The kind of concurrency you get from asynchronicity is suited for when you're dealing with hardware interrupts, e.g. receiving signals over the network. What you're doing is all directly done on the CPU except for printing something to the standard output via println! and I don't know, if that's worth asynchronizing[1]. My guess is, no it isn't.

You won't get around using multiple processor cores via multi-threading, if you want to improve the performance of your simulation. The trick is to have 2 universes stored in memory. Then you use the current universe immutably to overwrite the other one. You can split the universe into chunks, one for each thread, and let each thread calculate a single chunk. How you define a chunk is up to you. If it's hard to divide chunks evenly, you may be better off using a crate like rayon. You're responsible for dividing the universe into many smaller chunks while rayon is responsible for distributing the chunks to the threads in a way, that each processor core is kept busy.

Once the next universe has been calculated, it is marked as the current universe and the next step will re-use the previous universe's memory. I see a lot of parallels to Conway's Game of Life (CGoL) and you may want to search for some multi-threaded implementations to get a rough idea on how to do it. The major difference between your simulation and CGoL is, that the latter is usually based on mapping coordinates to objects while yours would be based on mapping objects to coordinates, which will complicate collision detection, if you care about that.

[1] println! does not magically become asynchronous by using it in an asynchronous context. You'll have to use a separate println!, that was made to work asynchronously. The standard library does not provide one.

3 Likes

Thank you very much for your answers! Let me rephrase my question slightly because having thought of the issue a bit, I feel like I can now better explain what confuses me.

First of all, concurrency and parallelism are often conflated and I would like to stress that what I'm after here is concurrency. In particular, I would like to write code in which there are multiple threads of control.

Let's call threads of control actors in order to distinguish the general idea of a thread of control from OS threads. As an example, imagine the task of counting and printing the length of each word in a text. While this can be achieved multiple ways, I would like to describe a solution that uses two independent agents:

Agent 1: Extract the next word from the text and store it in buffer x. Repeat until the entire text has been processed.
Agent 2: Read a word from buffer x, count its length, print it and repeat.

Importantly, Agent 1 and Agent 2 do not have to work simultaneously. For instance, their boss could schedule Agent 1 to work on every odd and Agent 2 to work on every even date.

Now putting two agents to work simultaneously could make things go faster but it would also introduce problems. What if Agent 1 wants to write the word "strawberry" in the common buffer but Agent 2 reads the buffer's content before Agent 1 could write the whole word and thinks that the next word it needs to process is "straw".

What I would like to emphasize here is that parallelism can complicate things and an application that wants to employ multiple threads of control but no parallelism is conceivable.

Something similar to what I've just outlined could be implemented like so using async-std:

async fn producer(text: &str, buffer: Sender<String>) {
    let text: Vec<char> = text.chars().collect();
    let mut start = 0;

    loop {
        while start < text.len() && text[start].is_whitespace() {
            start += 1;
        }
        if start == text.len() {
            return;
        }
        let mut end = start + 1;
        while end < text.len() && !text[end].is_whitespace() {
            end += 1;
        }

        let word: String = text[start..end].iter().collect();
        buffer.send(word).await.unwrap();
        println!("sent: {}", text[start..end].iter().collect::<String>());

        if end == text.len() {
            return;
        }
        start = end + 1;
    }
}

async fn consumer(buffer: Receiver<String>) {
    loop {
        match buffer.recv().await {
            Ok(word) => {
                let cnt = word.len();
                println!("{}", cnt);
            }
            Err(_) => return,
        }
    }
}

fn main() {
    let text = "strawberry is yummy";
    let (s, r) = bounded(2);

    let p = spawn_local(producer(text, s));
    let c = spawn_local(consumer(r));
    block_on(c);
    block_on(p);
}

Running the above code prints:

sent: strawberry
sent: is
10
2
sent: yummy
5

Now, when learning preemptively scheduled OS threads one is told that simply looking at the source code is often not enough to determine all the possible interleavings of threads (assuming sequential consistency). In particular, some instructions like x += 1 are not atomic even if they seem like they were. Also, compilers and hardware reorder instructions which can also lead to unexpected interleavings (more on this can be found, for instance, in the Nomicon: https://doc.rust-lang.org/nomicon/atomics.html)

For these reasons, I started to wonder whether one can get into similar troubles when using async/.await (in a single threaded environment), which brings me back to the code from my original post:

As already said, I would expect the sections to be executed atomically and reorderings to be impossible across await points.

Having given a second look to how asynchronous Rust gets desugared (https://rust-lang.github.io/async-book/04_pinning/01_chapter.html#why-pinning), I feel like @alice was right and assignments indeed cannot move across await points.

I actually came to the conclusion that it is not only assignments but, in fact, nothing should be able to move across await points. Is that correct?

As to the question of atomicity, I have the suspicion that without consulting the documentation of the executor, one cannot say whether the sections can be considered atomic. What I have in my mind here is a preemptive user space scheduler implemented, for instance, using timer interrupts and some funky register manipulation. Such an executor could prevent long-running tasks from monopolizing the CPU and trigger an unexpected "context switch".

Since there is another very closely related consideration I have, I would like like to use this oppourtunity and address it. Let's consider the following code:

fn main() {
    let x = Rc::new(RefCell::new(0));
    let clone1 = Rc::clone(&x);
    let clone2 = Rc::clone(&x);

    let h1 = spawn_local(async move {
        *x.borrow_mut() = 1;
    });
    let h2 = spawn_local(async move {
        *clone1.borrow_mut() = 2;
    });
    
    block_on(h1);
    block_on(h2);

    println!("{:?}", clone2);
}

To put it simply, my question is: Is this a data race?

Using the Nomicon's definition of data race (https://doc.rust-lang.org/nomicon/races.html),

one could conclude that there indeed is a data race in the above code. However, just below this definition the nomicon also says that:

Now, the above code compiles and seems to work fine, which makes me think that data races occur when different processing units try to simultaneoulsy access the same memory location. In other words, I could have equally asked: Are data races possible with a single single-core processor? What is the kind of trouble (under the hood) that data races are really about?

Going one step further, std::thread::spawn seems to be intended to spawn an OS thread but if I had an OS that never utilizes multiple processing units, wouldn't requiring spawn's argument to implement Send be unnecessary?

Yes, that is correct.

No, this is not possible. An executor has absolutely no control on when a task gives back control of the thread. The most it can do is refuse to poll the task again.

No. It violates the first requirement that it must happen from two different threads.

Fundamentally the problem is that the compiler assumes that data races cannot happen when optimizing your code, and if they do, then it will optimize the code incorrectly. You can use types such as atomic integers to tell the compiler that some integer may indeed be modified by some other thread, and since you've now told it, it can optimize it correctly while allowing this to happen. In fact, with the relaxed memory ordering on x86 cpus, atomic integers use the exact same assembly instructions as ordinary assignments. Note that on other cpus, this may not be the case because the CPU doesn't support data races, and in this case the compiler also needs to know about it so it can use the right instructions.

To give an example of such an optimization, consider some code that uses the same variable in 5 places but doesn't modify it. The compiler would really like to optimize this into just reading it once and reusing that value, but if the value is modified from other threads, this optimization would be invalid. And indeed, if you do this with an atomic integer calling load 5 times, then it will not make that optimization.

I recommend checking out my blog post on a related topic:

2 Likes

Related to your blog post, I earlier this year I made a lightweight cell projection crate that that doesn't require proc macros to use, and supports unsized projections to unsized fields on nightly.

https://crates.io/crates/cell-project

Nice! I'll have a look.