How expensive is Arc vs Rc vs & reference for accessing data

I understand that Arc is more expensive that Rc because it uses Atomics. And Rc is more expensive than reference because it increases/decreases reference count in runtime during clone/drop operations.

Another difference is that Arc and Rc allocate on the heap (and allocations are expensive) while a & reference is on a stack. However this allocation happens only during ::new, not clone/drop

Another difference is that Arc and Rc both add additional memory on the heap (usually 16 bytes for strong/weak counters). This may play a role when you care a lot about cache performance or embedded systems. However this additional memory is used one time, not for every clone of the Arc/Rc

My question: Is there any additional overhead in accessing (not storing or dropping) the underlying data under Arc/Rc vs reference. All three need dereferencing in Runtime but is Arc/Rc dereferencing more expensive?

1 Like

No, accessing an Rc/Arc in itself isn't more expensive. Atomics and/or the reference counter itself isn't needed for accessing the inner value, and a reference to the heap isn't inherently slower to access than a reference to any other memory area.

3 Likes

They (Arc/Rc) have identical code to identical* structures. Derefs are just pointers with fixed offset, in theory the offset adds overhead vs regular ref but CPUs have pipelines and dedicated transistors that eliminate any slowdown.

* atomic::AtomicUsize & Cell<usize> are identical when compiled and you don't access them.

There also can be compiler optimization that eliminates runtime code. In theory such might be the same for all types; reality is we don't know for sure and not inclined to put effort in to find out.

I also was wondering about the offset. So it seems to be a theoretical overhead which (maybe?) doesn't happen in practice on common architectures then.

The inner definition is here:

#[repr(C)]
struct RcBox<T: ?Sized> {
    strong: Cell<usize>,
    weak: Cell<usize>,
    value: T,
}

Which is used there:

#[cfg_attr(not(test), rustc_diagnostic_item = "Rc")]
#[stable(feature = "rust1", since = "1.0.0")]
#[rustc_insignificant_dtor]
pub struct Rc<T: ?Sized> {
    ptr: NonNull<RcBox<T>>,
    phantom: PhantomData<RcBox<T>>,
}
1 Like

Thanks I was worried about this offset actualy. Whether it adds one more instruction.
For the sake of experiment I went to https://rust.godbolt.org/ and compiled the following code with -C opt-level=3 (looks like it compiles to x86 architecture)

use std::hint::black_box;
pub fn test1() {
    let i = &0;
    black_box(i);
    let b = *i;
    println!("{}", b);
}
use std::rc::Rc;

pub fn test2() {
    let i = Rc::new(0);
    black_box(i.as_ref());
    let b = *i;
    println!("{}", b);
}

So the result for this let b = *i; operation in the first case is 2 instructions:

        mov     dword ptr [rsp + 4], 0
        lea     rax, [rsp + 4]

and in second case 3 instructions:

        mov     eax, dword ptr [rbx + 16]
        mov     dword ptr [rsp + 4], eax
        lea     rax, [rsp + 4]

Do you think pipelines eliminate this overhead?

1 Like

You could write a benchmark to be sure, e.g. using criterion.

Good point. I have a mac though. Can try it on the m1. Would be great if someone tries on x86

Thanks for the Rc code. I will also put here Arc code for completeness

#[repr(C)]
struct ArcInner<T: ?Sized> {
    strong: atomic::AtomicUsize,
    weak: atomic::AtomicUsize,
    data: T,
}

I think this depends on the Ordering used for clone/drop operations (it's Acquire and Release as I see in the code) and on the CPU architecture. It might compile into same instructions on x86 but not on arm64

I'm not sure that's the correct code to test with. The &0 makes the literal subject to static promotion. When testing with the following pieces of code, I got three instructions for both:

use std::hint::black_box;
use std::rc::Rc;

pub fn test1() {
    let i = 0;
    let ref_i = &i;
    black_box(ref_i);
    let b = *ref_i;
    println!("{}", b);
}

pub fn test2() {
    let i = Rc::new(0);
    black_box(i.as_ref());
    let b = *i;
    println!("{}", b);
}

the dereference compiling to

        mov     eax, dword ptr [rsp]
        mov     dword ptr [rsp + 4], eax
        lea     rax, [rsp + 4]

in the first case, and to

        mov     eax, dword ptr [rbx + 16]
        mov     dword ptr [rsp + 4], eax
        lea     rax, [rsp + 4]

in the second case.

Since the mov instruction has all sorts of offsets and multiplers (oh, did I mention it is compute-universal?), I highly doubt that there is any kind of statistically significant observable effect distinguishing the two snippets, and I'm practically entirely sure that dereferencing Rc is never going to be a meaningful bottleneck in any non-trivial application you are writing.

Although it's not done currently. Theoretically, Rc/Arc can store the pointer to T directly and use reverse offset when accessing ref counts. This eliminate the offset in mov instruction.

Great! Thanks for the code correction. I vote this is the solution

The offset is removed when using Rc::into_raw, for example.

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.