How to share a mutable reference between threads unsafely

godmar · August 8, 2021, 9:09pm

Ignoring potential data races related to the vector, I agree that this code should have defined behavior, so we can say a few things about it. We can say:

that all threads see flags as all true initially (happens-before induced by spawn)
that the parent thread will see flags as all false after all threads are joined (hb induced by join)
that no thread will see a true value after it has seen a false value for flags[i] (transitivity)

I think this should end it. Most notably, we cannot say

when a thread sees flags[i] to be false relative to other thread's progress
we cannot say anything about other atomic variables you may share between threads. In other words, loads and stores from/to those variables will have no relationship to the loads and stores to flags[i] across threads. In particular, when considering all updates by all threads to all atomic shared variables, there may not exist a sequential order that is consistent with the update order observed by the threads.

So the correctness of whatever algorithm these flags implement could be judged only once the rest of the code that accesses shared state is revealed (if there is such code).

It helps to be somewhat formal (it's not overdoing it, IMO) and stick with the definition of a data race as:

A program that has two conflicting evaluations has a data race unless either both conflicting evaluations are atomic operations one of the conflicting evaluations happens-before another.

where "conflicting evaluations" are defined as:

When an evaluation of an expression writes to a memory location and another evaluation reads or modifies the same memory location, the expressions are said to conflict

When adopting this view we wouldn't call it a "data race." Memory location btw is a byte.
So data races are undefined behavior, evaluations involving two atomic operations on the same memory location don't conflict, but rather are guarded by the memory model guarantees. These guarantees can range from nothing (relaxed) to very little (sequential consistency).

Although I'm not 100% sure, using relaxed assignments to atomics probably has no purpose beyond removing the formal undefinedness of a program that would otherwise exist. Most algorithms one might come up with require at least sequential consistency (IIRC classic algorithms like Dekker's and Peterson's do).

An approach that's sometimes taken is to start with an implementation that uses default atomics with sequential consistency, and then reason about how to relax it (for related operations, this would typically by acquire/release sections). See this paper for an example of this technique in the context of a lock free deque.

alice · August 8, 2021, 9:15pm

It is probably important to point out that relaxed atomics are very hard to reason about. For example, consider having two booleans that both start out false and these 3 threads:

Thread A sets bool_a to true.
Thread B loops until bool_a becomes true, then sets bool_b to true.
Thread C loops until bool_b becomes true, then reads the value of bool_a.

Then in the above program with relaxed atomics, it is possible for thread C to see false when it reads bool_a.

Similarly if two booleans are both set to true concurrently, then other threads may disagree which boolean became true first.

geeklint · August 8, 2021, 10:44pm

If we say (relaxed) atomic load/stores compile down to a single mov, it's important to recognize that that doesn't mean they won't have a performance impact. We might say non-atomic read/writes have the potential to compile down to fewer than one mov; the compiler may use a register to operate on the value or perhaps even trace the data flow through the CFG and completely eliminate the variable.

There's another thing about data races I've heard someone say, which is that you should assume non-atomic writes fill the value with a random bit-pattern during the write process. That is, given a variable that starts with a value of 1, and a thread that writes 0 to it, a second thread may have the opportunity to observe that variable in a state other than 0 or 1 (e.g. 7). This has the potential to cause bugs and much more serious problems in a language like Rust or C++ where a bool has a limited number of valid bit patterns. You could end up with a bool that is neither true or false.

chrefr · August 8, 2021, 11:52pm

With atomics, this is no more a data race, but a race condition (Race Condition vs. Data Race – Embedded in Academia (regehr.org)).

x86 is ordered and so all atomic operations are the same as non-atomic. ARM, for example is not, and thus atomic operations are a bit more expensive, but Relaxed is almost always (always?) the same. So yes, I think we can define it as "preventing some compiler optimizations" (whereas the other prevent also CPU optimizations).

godmar · August 8, 2021, 11:53pm

These are the kinds of rumors that the C11 memory model put to rest by making it UB.

godmar · August 9, 2021, 12:00am

I wouldn't say that. With atomics, you don't have a data race (by definition). Whether you have a race condition (which is another word for concurrency-related bug - that's also how Regehr defines it "a flaw that...") depends on your code and your requirements. In the example given (a monotonic transition from true to false, and presumably no error if 2 threads see a flag as true), there may not be a race condition.

chrefr · August 9, 2021, 12:00am

Yes, that what I meant. I had to be more accurate.

Coder-256 · August 10, 2021, 5:27am

I think the another important use for atomics is not only to prevent accidentally encountering confusing inter-thread behavior, and not only to prevent data races. Atomics can sometimes be necessary in order to avoid unsound optimizations that assume unique access. Basically, relaxed atomic operations allow you mutate the contents of shared references, exactly like Cell but multi-threaded/Sync.

For example:

use std::cell::Cell;
use std::sync::atomic::{AtomicU32, Ordering::Relaxed};

pub fn cell(target: &Cell<u32>) {
    target.set(123);
    // do some slow calculation (assume no side effects)...
    target.set(321);
}

pub fn ref_mut(target: &mut u32) {
    *target = 456;
    // do some slow calculation (assume no side effects)...
    *target = 654;
}

pub fn atomic(target: &AtomicU32) {
    target.store(789, Relaxed);
    // do some slow calculation (assume no side effects)...
    target.store(987, Relaxed);
}

("some slow calculation" could just be e.g. std::thread::sleep, but I had trouble finding a good, clean example that optimized properly)

Note the assembly generated in Godbolt. For the first example, since cells are single-threaded only, while our function is running, no other code can read from/write to target. And for the second example, we have fully mut/unique access to target (in fact, no other code can even hold a reference to target). In each case, since no other code can access target until the function returns, the first write is essentially dead code and can be optimized out.

However, in the atomic example, we don't necessarily have unique access while the function runs, so other threads can see the effect of the first write (e.g. while the current thread is doing the "slow calculation"), so we must actually perform both writes.

Come to think of it, I wonder if it makes more sense conceptually to say that Cell acts just like atomics, just with a few benefits enabled by single-threading (generic contents, better optimizations, etc.)

alice · August 10, 2021, 6:36am

Sure, I sometimes describe it as "atomics are just a multi-threaded Cell<int>". Similarly, a RefCell is just a single-threaded RwLock.

bjorn3 · August 10, 2021, 9:57am

x86 uses TSO (total store order) as memory model. This is much stronger than the ARM memory model, but it does sometimes require fences. In addition atomic operations are truly atomic, while some non-atomic operations behave as if they are multiple instructions. For example parallel inc may result in an end value less than expected if two parallel loads and increments happen before the store of the calculated value. lock inc will always result in the expected value as it is an actually atomic operation.

isibboi · August 12, 2021, 2:06pm

Thanks everyone for the explanations and taking the time to convince me! I think I am now more versed in the world of lock-free concurrency.

I simply changed my code now to use a Vec<AtomicBool>, since it seems to be the easiest solution without having to worry about performance.

The term conflicting evaluation makes sense godmar, thank you for that! And thank you for the paper about the lock-free queue, I will take a look.

vi0 · August 15, 2021, 8:26pm

Maybe you want TearCell from https://crates.io/crates/tearor ?

system · November 13, 2021, 8:26pm

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.

Topic		Replies	Views
Write to shared memory from multiple threads help	20	3267	February 16, 2020
What is the preferred way to share an immutable reference between threads? help	11	1641	January 10, 2022
Data races = undefined behavior?	22	3939	January 12, 2023
How to share a raw pointer between threads? help	6	2017	September 26, 2022
Multiple threads writting on the same shared data help	6	495	June 4, 2021

How to share a mutable reference between threads unsafely

Related Topics