Replicating false sharing in Rust

Hello folks,

I was trying to compare the performance of threads by incrementing the atomic and showing false sharing, following the video by CoffeeBeforeArch; his code was in C++. I tried the same code in Rust but I was getting same performance for false_sharing and no_sharing. I tried looking into my cache alignment and number of logical cores which were 64 and 4 respectively. I tried to run perf stat but were getting the same number of cache misses on both the benches.

please check the Rust Code: Code shared from the Rust Playground · GitHub

> cargo --version
cargo 1.77.0-nightly (7bb7b5395 2024-01-20)

> cargo bench -Zunstable-options --jobs 1  --profile=release -- --nocapture --exact

   Compiling sample v0.1.0 (/home/hdggxin/workspace/rust/sample)
    Finished release [optimized] target(s) in 1.32s
     Running unittests src/main.rs (target/release/deps/sample-86ce1eac8207e3c2)

running 4 tests
test direct_sharing ... bench:  12,681,022 ns/iter (+/- 476,356)
false_sharing: 0x7ffc00d03c10
false_sharing: 0x7ffc00d03c14
false_sharing: 0x7ffc00d03c18
false_sharing: 0x7ffc00d03c1c
test false_sharing  ... bench:   1,727,786 ns/iter (+/- 21,703)
no_sharing: 0x7ffc00d03ac0
no_sharing: 0x7ffc00d03b00
no_sharing: 0x7ffc00d03b40
no_sharing: 0x7ffc00d03b80
test no_sharing     ... bench:   1,727,036 ns/iter (+/- 12,869)
test single_thread  ... bench:   3,186,928 ns/iter (+/- 97,735)

test result: ok. 0 passed; 0 failed; 0 ignored; 4 measured; 0 filtered out; finished in 7.71s

so, how can we replicate false sharing in Rust.
or I am doing something wrong in the code.

Thanks in advance, :slight_smile:

The problem here is you're doing for r in result which moves all the atomics out of the array (this might do nothing), then you're moving them again into the closure which definitely separates them out. Give me a moment and I'll try to make one that works.

1 Like

Here, this is more like it

test direct_sharing ... bench:   3,966,700 ns/iter (+/- 177,116)
test false_sharing  ... bench:   3,852,850 ns/iter (+/- 175,762)
test no_sharing     ... bench:     300,655 ns/iter (+/- 38,133)
test single_thread  ... bench:     631,411 ns/iter (+/- 20,233)
#[bench]
fn false_sharing(b: &mut Bencher) {
    let result = [0, 0, 0, 0].map(|x| Small {
        number: AtomicU32::new(x),
    });
    for i in &result {
        println!("false_sharing: {:p}", i);
    }
    b.iter(|| {
        thread::scope(|s| {
            for r in &result {
                s.spawn(|| {
                    work(&r.number);
                });
            }
        });
    });
}

Edit: also note your align(1) isn't doing anything because the struct is already align(4) and you can't make it lower.

The rest of the code cleaned up: Rust Playground

2 Likes

64 potentially is too small alignment to prevent false sharing. Crossbeam uses 128 for most 64bit targets.

2 Likes

thanks @drewtato,

this seems to be working, but what I don't understand is why, iterating over an array, and moving into closure, is making it run fast? because atomics are not copy, so they should still be on the same address, meaning on the same cache line, and contend with other cores? But I think some part of optimisation rust move them to different addresses on move...

#[bench]
fn false_false_sharing(b: &mut bencher) -> impl termination {
    let result = [0, 0, 0, 0].map(|x| small {
        number: atomicu32::new(x),
    });
    b.iter(|| {
        thread::scope(|s| {
            let result = [0, 0, 0, 0].map(|x| small {
                number: atomicu32::new(x),
            });
            for r in result {
                    println!("false_false_sharing_before_move: {:p}", &r.number);
                s.spawn(move || {
                    println!("false_false_sharing_after_move: {:p}", &r.number);
                    work(&r.number);
                });
            }
        });
    });
}

and the result is that after move, the address are not near each other but moved to different part

false_false_sharing_before_move: 0x7ffd2dbacee4
false_false_sharing_before_move: 0x7ffd2dbacee4
false_false_sharing_before_move: 0x7ffd2dbacee4
false_false_sharing_before_move: 0x7ffd2dbacee4
false_false_sharing_after_move: 0x7f5e6ce50d6c
false_false_sharing_after_move: 0x7f5e6d255d6c
false_false_sharing_after_move: 0x7f5e67ffed6c
false_false_sharing_after_move: 0x7f5e6d054d6c

Do you know why are all the address same before move, and what is this optimisation called? and can this be disabled?

It's not really an optimization, but more a reflection of how moves in Rust work— In the absence of optimizations, every move will copy the value to a new memory location and invalidate its previous location:

  • The for r in result line moves each value out of result and into a local variable on the stack; the compiler reuses the same stack memory location for each iteration of the loop, which is why your befores all show the same value.
  • The move keyword then moves the value out of r and into a member variable of the anonymous struct that represents the closure (still on the stack, though).
  • The spawn call then moves the entire closure struct (including all of its members) onto the heap somewhere, so that the new thread can continue to access it even after the original scope of r goes away in the next loop iteration. It's this heap location that your afters are printing.
2 Likes

oh, got it @2e71828 ; Thanks allot for the explaination.

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.