Data races = undefined behavior?


#1

Is there a way to mutate shared data between threads, without synchronization, and without invoking undefined behavior? Miscalculations due to data races are tolerable in my code. Synchronization would hurt speed and/or memory usage, both of which are important to me.

Would this invoke undefined behavior?

use std::sync::Arc;
use std::cell::UnsafeCell;
use std::thread;

struct Data {
    val: u32,
}

struct Test {
    data: UnsafeCell<Data>
}

unsafe impl Sync for Test {}

fn main() {
    let data = Arc::new(Test {
        data: UnsafeCell::new(Data {
            val: 5,
        })
    });
    
    let data1 = data.clone();
    let thread1 = thread::spawn(move || {
        unsafe {
            (*data1.data.get()).val += 15;
        }
    });
    
    let data2 = data.clone();
    let thread2 = thread::spawn(move || {
        unsafe {
            (*data2.data.get()).val += 5;
        }
    });
    
    thread1.join().unwrap();
    thread2.join().unwrap();

    unsafe {
        println!("Val1: {}", (*data.data.get()).val);
    }        
}

(code adapted from this post)

By undefined behavior I mean code that the compiler is free to mess with in arbitrary ways; I know that it’s possible 25 won’t be printed (10 or 20 might be printed), as there is a data race.


#2

You can use an atomic integer if your data is that simple. The way I understand it is that it still makes some mild synchronization, but not as heavy as a Mutex or an RwLock. The selection is a bit limited at the moment, but you can take a look here if you are interested: http://doc.rust-lang.org/stable/std/sync/atomic/


#3

Rust doesn’t yet very concretely define undefined behavior and what it really means for your program, but it is clear that it is not something your program can do if you want to call it safe rust, see http://doc.rust-lang.org/nightly/reference.html#behavior-considered-undefined


#4

That code is a data race and hence is undefined behaviour. As @ogeon says, you can use atomics, and, in particular, the Relaxed ordering. You should avoid the fetch_ methods for mutation since those have more strict guarantees than you seem to want, instead you can use load/store manually, like:

let data: AtomicUsize = ...;

data.store(data.load(Ordering::Relaxed) + 15, Ordering::Relaxed)

#5

Check atomic-option if you need to handle more complex data.


#6

I’m worried atomics will be many times slower than regular operations.

Here is an analog to the kind of operation I want to perform: There exists a massive array of integers. Each thread continuously picks a random location of the array, and then increments the integer at that location by one. (note, my real code does not pick random locations, but the locations are random enough that contention would be low)

So, the only operation I would be performing with atomics is atomic increment, and the odds of two threads accessing the same memory location is very small. I wonder what kind of overhead atomics would entail in this situation. Perhaps I’ll do some benchmarks.

In case someone is curious, my program is a Buddhabrot renderer.


#7

Is this for the pixel samples? I guess it depends on the render/sampling technique, but a race condition may introduce some unwanted errors that forces you to render for a longer time. You could divide the image into tiles or sections and only let one thread work on each section. That’s super easy with most of the scoped thread pools out there. I used a similar method in a toy path tracer and it worked quite well.


#8

The errors should be negligible, because of how rare they are, combined with the fact that each error should only be a missed increment by one.
Due to the rendering technique of the Buddhabrot, splitting the image into tiles is not trivial and would require a different algorithm.


#9

Hmm… yeah, I didn’t really read the render algorithm before… I though it was similar to the Mandelbrot and Julia sets, but this one contributes all over the place. I don’t know what the best solution is.


#10

The Relaxed ordering is what you are looking for. I believe that it compiles to a (non-atomic) load/store on x86 (but could be wrong).

The reason that data races are undefined behavior is that they break assumptions made by the compiler’s optimizer. The compiler is allowed to assume that the data will never be incremented, and so the loads may be constant-folded away. Using an atomic load prevents this.


#11

I think this is the key thing. Without actually measuring, worrying about the expense (or not) of things, and particularly of relaxed atomic operations, is probably not the best use of time.


#12

I’ve written some benchmarks. These are the results I’m getting:

test bench::bench_atomic     ... bench:          44 ns/iter (+/- 2)
test bench::bench_no_sharing ... bench:          73 ns/iter (+/- 4)

where bench_atomic is the code with the threads modifying a shared array, and bench_no_sharing is the code with each thread modifying its own array. I’m surely doing something wrong, as I don’t see how it’s possible that atomics would be faster than no synchronization. I didn’t find information on how to benchmark multithreaded code, and the way I’m doing it feels hacky. Any advice/help would be appreciated.


#13

I wonder if the best way to benchmark code like this is to create executables that do a whole task and measure how long it takes for that to finish.

I suspect the results you’re seeing are because the no-sharing version is manipulating num_threads times as much memory as the atomics version, i.e. there’s a whole pile more memory traffic and pressure on the CPU cache(s) (and hundreds of megabytes won’t fit entirely into any level of cache, let alone gigabytes as in the no sharing test). As evidence for this: on my 8 core (4 physical, with 2 hyperthreads each) machine, the default version gives 37 ns/iter vs. 138 ns/iter and manually forcing it to use 4 threads brings the numbers to be similar to yours.

A quick look with perf (perf stat ./target/release/... --bench on Linux) indicated that the no-sharing version is doing an order of magnitude more context switches, and has 4 times as many page faults and performs 4 times fewer instructions per second. Looking at the L1 cache specifically (as before, but with the -e LLC-load-misses,LLC-loads,L1-dcache-load-misses,L1-dcache-loads arguments) shows no-sharing also having almost 4 times as many cache misses as atomic for the cache closest to the CPUs (L1), but about the same for the one furthest (LLC == last level cache).


#14

I’ve added a set of benchmarks that simply time the range starting when the threads are launched, up to the threads joining. Here’s my results:

  • atomic 5.6553235 seconds
  • no_sharing 6.136351 seconds

I’m still amazed the atomic version is faster. I had just assumed that atomics would be slower, without considering the intricacies of CPU caches. It appears that using atomics in this situation is the best option, as it uses way less memory and is even a bit faster. Benchmarking is a good idea.


#15

I think you still have a small misunderstanding: atomic loads and stores with the Relaxed ordering are typically compiled to the exact same assembly as regular loads and stores, because most architectures already guarantee that individual loads and stores to aligned addresses are atomic (won’t mix bits from different values or something). The only difference is what the compiler is allowed to assume when making optimizations. Of course this does not apply to atomic operations other than load and store, like fetch_add (a true “atomic add”), because then the combination of load and store must be atomic, rather than each individually. And loads and stores with stronger orderings often require memory barrier instructions.

Example: http://is.gd/tIcNaZ (compile in release mode and look at the generated assembly for one and two)


#16

I want to give you this tip about how you can have an easier to read asm file with crate_type=“lib”. I use it all the time for things like this. http://is.gd/XT8fWB


#17

You can even more easily see what’s going on at http://rust.godbolt.org. It has options to strip out all of the extraneous annotation in the assembler, and highlights lines in the source corresponding to lines in the result. For instance, this example shows up quite nicely: https://goo.gl/zGx1y6


#18

godbolt shows it using the stack, but that’s not on playpen. I don’t understand what’s causing the difference.


#19

I believe it uses debug info to highlight source lines, which seem to force extra spills.


#20

It makes me sad whenever debuginfo affects code generation.
Is that inherent to LLVM, or could it be tuned somehow?