How to think about assigning None (for performance)

I was confronted with a mystery. Adding an Arc<T> to one of my structures cost me quite a lot performance, even though that structure should have been used only in a debug code path, and never accessed.

So I made a benchmark to drill in on exactly that situation in isolation.

use divan::{Divan, Bencher, black_box};

struct WithArc {
    _shared: std::sync::Arc<u64>,
    _private: Vec<u8>,
}

struct WithoutArc {
    _shared: u64,
    _private: Vec<u8>,
}

fn main() {
    println!("sizeof WithArc={} bytes", core::mem::size_of::<Option<WithArc>>());
    println!("sizeof WithoutArc={} bytes", core::mem::size_of::<Option<WithoutArc>>());

    let divan = Divan::from_args()
        .sample_count(5000);
    divan.main();
}

#[divan::bench(args = [10000, 20000, 40000, 80000, 160000, 320000])]
fn with_arc(bencher: Bencher, n: u64) {

    let mut t: Option<WithArc> = None;
    bencher.bench_local(|| {
        for _ in 0..n {
            *black_box(&mut t) = None;
        }
    });
}

#[divan::bench(args = [10000, 20000, 40000, 80000, 160000, 320000])]
fn without_arc(bencher: Bencher, n: u64) {

    let mut t: Option<WithoutArc> = None;
    bencher.bench_local(|| {
        for _ in 0..n {
            *black_box(&mut t) = None;
        }
    });
}

And here are the results:

sizeof WithArc=32 bytes
sizeof WithoutArc=32 bytes
Timer precision: 36 ns
arc             fastest       │ slowest       │ median        │ mean          │ samples │ iters
├─ with_arc                   │               │               │               │         │
│  ├─ 10000     9.756 µs      │ 61.51 µs      │ 10.01 µs      │ 10.37 µs      │ 5000    │ 5000
│  ├─ 20000     19.49 µs      │ 414.5 µs      │ 19.98 µs      │ 20.34 µs      │ 5000    │ 5000
│  ├─ 40000     38.99 µs      │ 197.9 µs      │ 39.93 µs      │ 40.57 µs      │ 5000    │ 5000
│  ├─ 80000     69.2 µs       │ 224.3 µs      │ 79.82 µs      │ 78.63 µs      │ 5000    │ 5000
│  ├─ 160000    135.5 µs      │ 494 µs        │ 146.1 µs      │ 152.1 µs      │ 5000    │ 5000
│  ╰─ 320000    271 µs        │ 703 µs        │ 318.8 µs      │ 305.5 µs      │ 5000    │ 5000
╰─ without_arc                │               │               │               │         │
   ├─ 10000     7.052 µs      │ 99.48 µs      │ 7.068 µs      │ 7.384 µs      │ 5000    │ 5000
   ├─ 20000     14.08 µs      │ 102.8 µs      │ 16.22 µs      │ 16.19 µs      │ 5000    │ 5000
   ├─ 40000     28.15 µs      │ 161.5 µs      │ 32.43 µs      │ 32.07 µs      │ 5000    │ 5000
   ├─ 80000     56.28 µs      │ 186.3 µs      │ 59.42 µs      │ 61.87 µs      │ 5000    │ 5000
   ├─ 160000    112.5 µs      │ 445.9 µs      │ 129.6 µs      │ 124.4 µs      │ 5000    │ 5000
   ╰─ 320000    225 µs        │ 603.9 µs      │ 237.7 µs      │ 245.8 µs      │ 5000    │ 5000

Sure enough. Assigning None when the None contains an arc is 30% more expensive than when it doesn't.

But if I cut the structures down to only the Arc vs the u64, the difference disappears. So something about the Arc being part of another structure causes the issue.

Looking over the generated asm, it looks like the Arc case has a couple extra vmovups instructions.

I guess I'm confused as to why it's not just a matter of setting the enum tag / discriminant, regardless of whatever else was inside the enum.

Sorry for the rambling.
Thanks for any thoughts.

Also worth mentioning both structures have exactly the same align_of. (8 bytes)

If you assign none, it needs to atomically decrement the strong count, to see if the T inside the Arc should be dropped. If you are dealing with integers, try using an AtomicU64, for example.

Edit: godbolt for reference Compiler Explorer

4 Likes

Ah that makes sense. In my case it's overwriting None, but I see it would also need to check what's there before writing. I guess that's true of the Vec too; it would need to check if the vec's storage has been allocated and requires freeing before overwriting it.

If you are dealing with integers, try using an AtomicU64 , for example.

Unfortunately I really do need the Arc. There is a lot behind it. The tricky bit is that it's only Some in one case out of a million, so I'm trying to figure out how I can avoid paying for it the other 999,999 times.

In my simplified benchmark, putting the whole thing inside a Box equalizes things. Unfortunately that doesn't help at all in my actual code. So It seems there is another confounder in the picture.

Thanks again for the insight.

1 Like

It only needs to check once, though, whether the Option is None. There isn’t a need to check for each field, so it is surprising that the presence of the Arc matters.

Looking at the assembly from @jendrikw’s link, I notice that

  • the compiler has decided to put the Some case first when there is an Arc and the None case first when there isn’t, and
  • there is more register saving going on in the version with Arc.

This suggests that there’s room for improvement. What happens to your benchmark if you move the deallocation explicitly into a #[cold] function? The #[cold] attribute hints to the compiler that “this is unlikely to be called”, which encourages code generation more favorable to the case where it is not called.

pub fn with_arc(t: &mut Option<WithArc>) {
    if let Some(wa) = t.take() {
        drop_cold(wa);
    }
}

#[cold]
fn drop_cold(_: WithArc) {}

All that said, this seems likely to be one of those cases where it’s not so much intrinsically slow, as perturbed to be slow. Such cases can be very subtle (e.g. the alignment in memory of the machine code can affect its fetching and caching) and mean that the change in performance may be largely unrelated to your high-level choices.

5 Likes

Your solution is the fastest of the 3 (faster than no Arc, but with a Vec. Similar perf to boxing the struct)

So I guess it really was the drop branch being mispredicted.

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.