How to share a mutable reference between threads unsafely

Hi,

I have an algorithm that achieves great speedups by unsafely reading from and writing to a bitvector in parallel. At the moment I am sharing the bitvector by converting its reference into a pointer, and the pointer into an usize, which I then convert back within each single thread.

But I am wondering if the standard library contains a readily available primitive for this use case. When googling I somehow only ended up at RefCell, which does dynamic borrow checking. But that would not work for me, as I want to simultaneously read from and write into the bitvector.

This is the (almost) exact definition of data race, which is undefined behavior.

If something needs to be mutated from multiple threads, you must make some kind of synchronization between readers and writers - otherwise it's possible to read something you didn't write, just because compiler made some wrong assumption based on your promises. You can use something like Arc<[AtomicU8]>, if this suits your needs, however.

5 Likes

Thank you for your explanation, I appreciate it. I am aware of data races, and I am sure that my algorithm is robust against them.

I would really only be interested in some type in the standard library (or some crate I have not found) that allows to share data between threads such that it is mutable simultaneously by all threads, without any locking or safety guarantees.

Data race is UB, and there's no such thing robust against it as it's a contract violation between you and the compiler. If you've not heard about the UB-things before take a look at this Wikipedia article.

1 Like

Even if this is true (e.g. if you only read and write anything one byte at a time, not even one word at a time), the "wrong assumption based on your promises" point still stands. The simplest miscompilation that comes to mind is "well, there's mutable reference, it's exclusive by definition, so two consecutive reads from it will give the same value, so the second read is unnecessary".

1 Like

Thank you for your comment. I know about undefined behavior. Unfortunately I cannot explain why I do not have any problems with undefined behavior in concrete terms, as the code I am working on is currently confidential. We will publish it within the next two months though, so if you are interested I can update this thread then.

The trick to why it works is simpler than one might think though, it involves writing only false to true bits and never the other way around, along with a separate data structure that provides locks to ensure that the bits get updated in the correct way. However, it is not bad if one thread writes while another reads, as the error caused by this is mitigated as soon as the reading thread itself tries to write. It just retries the reading part of the computation until it arrives at a sane state in the write phase. Since reads occur so much more often than writes, and the number of threads is very low compared to the number of bits, the probability that an error actually occurs is also very low, resulting in a very efficient algorithm with a very low locking overhead.

But yeah, let's please keep this on topic. If someone knows about a way to write my "reference to usize and back"-conversion in a nice way with abstract pointer types, without any locking or safety guarantees, I would be very happy to know.

Multiply that by how much it will be used. You see, at some point even events with tiny probabilities occur frequently if compounded by a lot of usage.

This is similar reasoning as in astrophysics:
The probability of e.g. a supernova occurring at any given moment in any galaxy is very low, at least from our perspective (it happens roughly once in 50 years). But since there are hundreds of trillions of galaxies, there are still many supernova explosions every second.

If the probability of errors is not exactly 0, at some point such errors will happen.

1 Like

It sounds like you’re generally doing things that Rust’s references don’t allow by definition— Use raw pointers instead, so that the compiler doesn’t make any incorrect assumptions.

If your code is preventing data races some other way, you can unsafe impl Send for your structure that contains the pointers to avoid the usize conversions.

7 Likes

You could check out this SO answer, which provides a way to write to a slice from several threads without synchronization. As long as each thread writes to two different bytes in memory, you're fine, but if two threads want to modify the same byte (even if different bits), then it's UB if you dont have a way to synchronize the access.

6 Likes

Thank you both @2e71828 and @alice for your answers.

As I understand, both UnsafeCell from that StackOverflow post and using a *mut T where T: Send would be a "correct" way of sharing my bitvector mutably between threads. The UnsafeCell would allow me to get a *mut T, so let me treat both as a pointer. In contrast to my way of using a usize to transfer the pointer, I am directly transferring the pointer. Sounds simple.

However you uncovered a problem in my code that goes a little bit beyond this: I am simply casting that mutable pointer to a mutable reference, and expect things to work. And things do work, because if the compiler would have optimised out any of the accesses to the bitvector, my algorithm would not terminate. So I must assume that even though there are multiple mutable references to the same bitvector, the compiler either did not notice, or it decided to do nothing about it. Still, it is a bit unnerving.

I would happily implement @2e71828's idea and simply stick with the pointer without ever converting it to a reference. However, the code that is using the bitvector is expecting a reference, and I would like it to stay that way. The code is a simple Dijkstra and has no reason to use pointers over references, or anything else unsafe. Also, it would be odd to copy and paste the code to make one variant with references, and one with pointers. So I really strongly believe that architecture-wise, the variant with violating Rusts borrowing laws is still the best.

But then, as you said, this is undefined behaviour, so potentially problematic. But would the compiler actually care? I have an intended race condition, which might cause errors, but I am aware and able to catch and fix them all. But if the compiler would compile the race condition in an unexpected way, that would break the program.

I feel like the open question is now if the compiler would actually perform different optimisations if it would know that a mutable or immutable reference can be mutated concurrently, or if it would ignore that.
If someone knows, please go ahead. But I will anyways ask about this on the internals forum, since there are likely more people that know the details.

Why not use Atomic* types in your bitvec? That way you do the bare minimum to synchronize, so you avoid UB. And it won't have significant overhead for reads.

6 Likes

This is an algorithm that makes the assumption of sequential consistency. Unless you use atomics as mentioned by RustyYato, it won't work. "Won't work" means it will result in code that may appear to work on one platform and one compiler, but with high likelihood not on another. My recommendation would be to read up on memory models in modern languages.

Start with:

And continue with the section on Atomics in the Rustonomicon.

For my class, I recorded a video on how to use C11 Atomics, available on youtube. It highlights a number of cases where Atomics can help you bring back defined behavior.

8 Likes

The only way to correctly write an intended race condition is atomics.

3 Likes

an Ordering::Relaxed load/store atomic operation is likely to be optimized to a single mov instruction. If you truly only care about setting some bools to true, and don't have any other data you are modifying that is dependant on the bools, this is likely to be both correct and sufficiently performant.

2 Likes

Let me give you an example of a "data race" that is not harmful, even when no assumption of memory ordering or atomics are involved.

The example is stripped down a lot. It spawns eight threads that share a vector of booleans, and each thread tries to output those indices in the vector that are true. Naturally, since there is no synchronisation involved, each number may be printed more than once. However, for the purpose of this example, let us assume that there is some magic that prevents this.

Let us look at the following C++ code.

#include <thread>
#include <vector>
#include <iostream>

int main() {
    std::vector<bool> flags(64, true); // Solutions not found yet.
    std::vector<std::thread> threads;

    for (int t = 0; t < 8; t++) {
        threads.emplace_back([&] {
            // Here we simply iterate over the flags,
            // but in reality finding the "right" flag is far more complex.
            for (size_t i = 0; i < flags.size(); i++) {
                if (flags[i]) {
                    std::cout << i << ";"; // Process the solution we found.
                    flags[i] = false;
                }
            }
        });
    }

    for (auto& thread : threads) {
        thread.join();
    }
    std::cout << std::endl;
}

In C++, there is no such thing as borrowing rules for mutable variables, so we are free to create as many mutable references to flags as we want and share them between different threads. We are also free to read and write them in parallel as much as we want. We can of course not be sure what the final value written will be if we do not synchronise, or what values other threads read in the meantime.

However, in our case the only value written is false, and the only effect of reading a "wrong" value is arriving at the line processing our solution twice with the same i. But this is not a problem, as it is explicitly allowed. So even though there is a data race without assumptions about memory ordering or atomics, everything should go as expected. Or am I missing something?

Yes, you're missing the sledgehammer power of "undefined behavior." A compliant C++ compiler can think about your program for a second and then emit the following code:

int
main()
{
    printf("pineapple\n");
}

and it would be compliant with the C++ specification.

The progress that was made over the last 20 years (with Boehm's and Adve's work in the 2000's, followed by the development of the C/C++ and Java memory models, and the introduction of atomics into the C/C++ languages - which is what Rust piggybacks on) allow us to reason about what the possible behavior of code such as yours might be.

For instance, if you declared your vector std::vector<std::atomic<bool>> (and made no other changes (*)) we would now be able to reason about your code, about what accesses could happen in what order, about the set of possible values threads would see in the shared array, about when other values would be seen by threads, etc. etc. We could then apply the reasoning you're aiming for, which is that it may be an algorithm that operates on shared data without locks in a (hopefully) correct fashion.

(*) I'm not quite sure about that, let's pretend for a second you were using an array instead of a vector.

7 Likes

Okay, let us consider this code. It is the same as before, except with the changes made. And let us take aside the vector/array problem. (I would guess that we only do read accesses to the vector struct itself, and only write to the slice of memory pointed to by vector::data)

#include <thread>
#include <vector>
#include <iostream>
#include <atomic>

int main() {
    std::vector<std::atomic<bool>> flags(64, true); // Solutions not found yet.
    std::vector<std::thread> threads;

    for (int t = 0; t < 8; t++) {
        threads.emplace_back([&] {
            // Here we simply iterate over the flags,
            // but in reality finding the "right" flag is far more complex.
            for (size_t i = 0; i < flags.size(); i++) {
                if (flags[i].load(memory_order::relaxed)) {
                    std::cout << i << ";"; // Process the solution we found.
                    flags[i].store(false, memory_order::relaxed);
                }
            }
        });
    }

    for (auto& thread : threads) {
        thread.join();
    }
    std::cout << std::endl;
}

Now there should not be any undefined behaviour anymore, or is there?

And speaking about performance, loading and storing a boolean should be atomic on e.g. x86 architectures, so the calls to load and store should compile to a single mov? (like geeklink and RustyYato hinted?) So if the compiler would not treat data races as undefined behaviour, this code would compile the same as the code in my last post?

The version with atomics is fine (though it would not be a bitvector, but each boolean takes up a full byte). It would also be possible to write the version with atomics in Rust.

In Rust, you would be able to write a true atomic bitvector by setting specific bits with fetch_or, although this would likely be less performant than using a byte per boolean.

3 Likes

Performance-wise, I don't care about writes, since in the actual algorithm, reads are much more frequent. But if I update one bit with fetch_*, and read another bit from an unrelated memory location at the same time, using relaxed memory order, then the read should still be as fast as a read without using atomics?


After reading the article linked by godmar, I think I understand more behind the motivation of forbidding all race conditions. However it sounds like race conditions are still possible by using load and store with relaxed memory ordering for an atomic. In fact, it sounds like doing this allows for all the freedoms a compiler that is unaware of race conditions would give, except that the loads and stores are actually atomic. And except that such a "legal" race condition is defined as not a race condition, as an atomic is a "synchronisation variable".

So am I understanding correctly that excluding race conditions in Rust and C++ means not actually forbidding them, but hiding them behind specific syntax that makes it impossible to stumble upon them by accident?

Yeah, this is pretty much correct. Atomics is a way to tell the compiler "data races could happen here", and the ordering determines how well behaved those data races should be.

1 Like