I was confronted with a mystery. Adding an Arc<T>
to one of my structures cost me quite a lot performance, even though that structure should have been used only in a debug code path, and never accessed.
So I made a benchmark to drill in on exactly that situation in isolation.
use divan::{Divan, Bencher, black_box};
struct WithArc {
_shared: std::sync::Arc<u64>,
_private: Vec<u8>,
}
struct WithoutArc {
_shared: u64,
_private: Vec<u8>,
}
fn main() {
println!("sizeof WithArc={} bytes", core::mem::size_of::<Option<WithArc>>());
println!("sizeof WithoutArc={} bytes", core::mem::size_of::<Option<WithoutArc>>());
let divan = Divan::from_args()
.sample_count(5000);
divan.main();
}
#[divan::bench(args = [10000, 20000, 40000, 80000, 160000, 320000])]
fn with_arc(bencher: Bencher, n: u64) {
let mut t: Option<WithArc> = None;
bencher.bench_local(|| {
for _ in 0..n {
*black_box(&mut t) = None;
}
});
}
#[divan::bench(args = [10000, 20000, 40000, 80000, 160000, 320000])]
fn without_arc(bencher: Bencher, n: u64) {
let mut t: Option<WithoutArc> = None;
bencher.bench_local(|| {
for _ in 0..n {
*black_box(&mut t) = None;
}
});
}
And here are the results:
sizeof WithArc=32 bytes
sizeof WithoutArc=32 bytes
Timer precision: 36 ns
arc fastest │ slowest │ median │ mean │ samples │ iters
├─ with_arc │ │ │ │ │
│ ├─ 10000 9.756 µs │ 61.51 µs │ 10.01 µs │ 10.37 µs │ 5000 │ 5000
│ ├─ 20000 19.49 µs │ 414.5 µs │ 19.98 µs │ 20.34 µs │ 5000 │ 5000
│ ├─ 40000 38.99 µs │ 197.9 µs │ 39.93 µs │ 40.57 µs │ 5000 │ 5000
│ ├─ 80000 69.2 µs │ 224.3 µs │ 79.82 µs │ 78.63 µs │ 5000 │ 5000
│ ├─ 160000 135.5 µs │ 494 µs │ 146.1 µs │ 152.1 µs │ 5000 │ 5000
│ ╰─ 320000 271 µs │ 703 µs │ 318.8 µs │ 305.5 µs │ 5000 │ 5000
╰─ without_arc │ │ │ │ │
├─ 10000 7.052 µs │ 99.48 µs │ 7.068 µs │ 7.384 µs │ 5000 │ 5000
├─ 20000 14.08 µs │ 102.8 µs │ 16.22 µs │ 16.19 µs │ 5000 │ 5000
├─ 40000 28.15 µs │ 161.5 µs │ 32.43 µs │ 32.07 µs │ 5000 │ 5000
├─ 80000 56.28 µs │ 186.3 µs │ 59.42 µs │ 61.87 µs │ 5000 │ 5000
├─ 160000 112.5 µs │ 445.9 µs │ 129.6 µs │ 124.4 µs │ 5000 │ 5000
╰─ 320000 225 µs │ 603.9 µs │ 237.7 µs │ 245.8 µs │ 5000 │ 5000
Sure enough. Assigning None
when the None
contains an arc is 30% more expensive than when it doesn't.
But if I cut the structures down to only the Arc
vs the u64, the difference disappears. So something about the Arc
being part of another structure causes the issue.
Looking over the generated asm, it looks like the Arc case has a couple extra vmovups
instructions.
I guess I'm confused as to why it's not just a matter of setting the enum tag / discriminant, regardless of whatever else was inside the enum.
Sorry for the rambling.
Thanks for any thoughts.