How to share a mutable reference between threads unsafely

Ignoring potential data races related to the vector, I agree that this code should have defined behavior, so we can say a few things about it. We can say:

  • that all threads see flags as all true initially (happens-before induced by spawn)
  • that the parent thread will see flags as all false after all threads are joined (hb induced by join)
  • that no thread will see a true value after it has seen a false value for flags[i] (transitivity)

I think this should end it. Most notably, we cannot say

  • when a thread sees flags[i] to be false relative to other thread's progress
  • we cannot say anything about other atomic variables you may share between threads. In other words, loads and stores from/to those variables will have no relationship to the loads and stores to flags[i] across threads. In particular, when considering all updates by all threads to all atomic shared variables, there may not exist a sequential order that is consistent with the update order observed by the threads.

So the correctness of whatever algorithm these flags implement could be judged only once the rest of the code that accesses shared state is revealed (if there is such code).

It helps to be somewhat formal (it's not overdoing it, IMO) and stick with the definition of a data race as:

A program that has two conflicting evaluations has a data race unless either both conflicting evaluations are atomic operations one of the conflicting evaluations happens-before another.

where "conflicting evaluations" are defined as:

When an evaluation of an expression writes to a memory location and another evaluation reads or modifies the same memory location, the expressions are said to conflict

When adopting this view we wouldn't call it a "data race." Memory location btw is a byte.
So data races are undefined behavior, evaluations involving two atomic operations on the same memory location don't conflict, but rather are guarded by the memory model guarantees. These guarantees can range from nothing (relaxed) to very little (sequential consistency).

Although I'm not 100% sure, using relaxed assignments to atomics probably has no purpose beyond removing the formal undefinedness of a program that would otherwise exist. Most algorithms one might come up with require at least sequential consistency (IIRC classic algorithms like Dekker's and Peterson's do).

An approach that's sometimes taken is to start with an implementation that uses default atomics with sequential consistency, and then reason about how to relax it (for related operations, this would typically by acquire/release sections). See this paper for an example of this technique in the context of a lock free deque.

2 Likes

It is probably important to point out that relaxed atomics are very hard to reason about. For example, consider having two booleans that both start out false and these 3 threads:

  1. Thread A sets bool_a to true.
  2. Thread B loops until bool_a becomes true, then sets bool_b to true.
  3. Thread C loops until bool_b becomes true, then reads the value of bool_a.

Then in the above program with relaxed atomics, it is possible for thread C to see false when it reads bool_a.

Similarly if two booleans are both set to true concurrently, then other threads may disagree which boolean became true first.

4 Likes

If we say (relaxed) atomic load/stores compile down to a single mov, it's important to recognize that that doesn't mean they won't have a performance impact. We might say non-atomic read/writes have the potential to compile down to fewer than one mov; the compiler may use a register to operate on the value or perhaps even trace the data flow through the CFG and completely eliminate the variable.

There's another thing about data races I've heard someone say, which is that you should assume non-atomic writes fill the value with a random bit-pattern during the write process. That is, given a variable that starts with a value of 1, and a thread that writes 0 to it, a second thread may have the opportunity to observe that variable in a state other than 0 or 1 (e.g. 7). This has the potential to cause bugs and much more serious problems in a language like Rust or C++ where a bool has a limited number of valid bit patterns. You could end up with a bool that is neither true or false.

With atomics, this is no more a data race, but a race condition (Race Condition vs. Data Race – Embedded in Academia (regehr.org)).

x86 is ordered and so all atomic operations are the same as non-atomic. ARM, for example is not, and thus atomic operations are a bit more expensive, but Relaxed is almost always (always?) the same. So yes, I think we can define it as "preventing some compiler optimizations" (whereas the other prevent also CPU optimizations).

2 Likes

These are the kinds of rumors that the C11 memory model put to rest by making it UB.

I wouldn't say that. With atomics, you don't have a data race (by definition). Whether you have a race condition (which is another word for concurrency-related bug - that's also how Regehr defines it "a flaw that...") depends on your code and your requirements. In the example given (a monotonic transition from true to false, and presumably no error if 2 threads see a flag as true), there may not be a race condition.

Yes, that what I meant. I had to be more accurate.

I think the another important use for atomics is not only to prevent accidentally encountering confusing inter-thread behavior, and not only to prevent data races. Atomics can sometimes be necessary in order to avoid unsound optimizations that assume unique access. Basically, relaxed atomic operations allow you mutate the contents of shared references, exactly like Cell but multi-threaded/Sync.

For example:

use std::cell::Cell;
use std::sync::atomic::{AtomicU32, Ordering::Relaxed};

pub fn cell(target: &Cell<u32>) {
    target.set(123);
    // do some slow calculation (assume no side effects)...
    target.set(321);
}

pub fn ref_mut(target: &mut u32) {
    *target = 456;
    // do some slow calculation (assume no side effects)...
    *target = 654;
}

pub fn atomic(target: &AtomicU32) {
    target.store(789, Relaxed);
    // do some slow calculation (assume no side effects)...
    target.store(987, Relaxed);
}

("some slow calculation" could just be e.g. std::thread::sleep, but I had trouble finding a good, clean example that optimized properly)

Note the assembly generated in Godbolt. For the first example, since cells are single-threaded only, while our function is running, no other code can read from/write to target. And for the second example, we have fully mut/unique access to target (in fact, no other code can even hold a reference to target). In each case, since no other code can access target until the function returns, the first write is essentially dead code and can be optimized out.

However, in the atomic example, we don't necessarily have unique access while the function runs, so other threads can see the effect of the first write (e.g. while the current thread is doing the "slow calculation"), so we must actually perform both writes.


Come to think of it, I wonder if it makes more sense conceptually to say that Cell acts just like atomics, just with a few benefits enabled by single-threading (generic contents, better optimizations, etc.)

Sure, I sometimes describe it as "atomics are just a multi-threaded Cell<int>". Similarly, a RefCell is just a single-threaded RwLock.

2 Likes

x86 uses TSO (total store order) as memory model. This is much stronger than the ARM memory model, but it does sometimes require fences. In addition atomic operations are truly atomic, while some non-atomic operations behave as if they are multiple instructions. For example parallel inc may result in an end value less than expected if two parallel loads and increments happen before the store of the calculated value. lock inc will always result in the expected value as it is an actually atomic operation.

3 Likes

Thanks everyone for the explanations and taking the time to convince me! I think I am now more versed in the world of lock-free concurrency.

I simply changed my code now to use a Vec<AtomicBool>, since it seems to be the easiest solution without having to worry about performance.

The term conflicting evaluation makes sense godmar, thank you for that! And thank you for the paper about the lock-free queue, I will take a look.

Maybe you want TearCell from https://crates.io/crates/tearor ?

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.