RC memory layout flipping

Given the code snippet from the following question:

use std::rc::Rc;

let vec_var = vec![1.0, 2.0, 3.0];
let foo = Rc::new(vec_var);
let a = Rc::clone(&foo);
let b = Rc::clone(&foo);

@thv has made a drawing about the actual memory layout.

struct RcBox<T: ?Sized> {
    strong: Cell<usize>,
    weak: Cell<usize>,
    value: T,
}

I just wonder why we have the fields in this order.
This memory arrangement favors the cloning aspect of the RC instead of the access, so the ref. counting block is under the pointer.

If I wanted to clone the RC, then in x86 it would be something like that:

mov eax, [esp+0x10] // Loading the address inside the pointer into the eax register
inc [eax]

If I wanted to access the underlying vector

mov eax, [esp+0x10] // Loading the address inside the pointer into the eax register
mov eax, [eax+0x4] // eax contains the address if the first number (1.0)

As you can see there is an address arithmetic step (+0x4) involved in the data access.

I guess for a smart pointer the dereferencing is much more common than the cloning, so I wonder if RC shall optimize for that. Swapping the members around:

struct RcBox<T: ?Sized> {
    value: T,
    strong: Cell<usize>,
    weak: Cell<usize>,
}

Cloning

mov eax, [esp+0x10] // Loading the address inside the pointer into the eax register
inc [eax+sizeof(T)]

Access the underlying vector

mov eax, [esp+0x10] // Loading the address inside the pointer into the eax register
mov eax, [eax] // eax contains the address if the first number (1.0)

Maybe for x86 it makes no difference (or very small), but for other architectures, where there is no such a sophisticated addressing logic, it might make a difference (if an explicit "add" operation is needed)

This may be more of an internals discussion, as the layout of RcBox is an implementation detail not exposed to consumers of the standard library.

But anyway, the data field has to be last for at least the case of dynamically sized types (e.g. Rc<str>).

4 Likes

Note the T: ?Sized in the declaration of RcBox. This means that a dynamically-sized RcBox can be constructed, where the size and alignment of its value is only known at runtime. In safe code, this can be done with an unsized coercion: an Rc<[i32; 1]> can be converted to an Rc<[i32]>, and an Rc<[i32; 2]> can also be converted to an Rc<[i32]>. Each Rc<[i32]> is responsible for tracking the length of the slice in its underlying RcBox<[i32]>.

Since value is placed at the end of the struct, an Rc<[i32]> can use constant offsets to access the strong, weak, and value members (+0, +8, and +16 respectively, on 64-bit systems). But if value were instead placed at the start of the struct, and could be of any length, then it would take several more operations to access the strong and weak members:

# rdi = pointer to RcBox
# rsi = slice length 
lea rax, [4*rsi + 4]
and rax, -8
mov rax, qword ptr [rdi + rax]
# rax = strong count

# rdi = pointer to RcBox
# rsi = slice length 
lea rax, [4*rsi + 4]
and rax, -8
mov rax, qword ptr [rdi + rax + 8]
# rax = weak count

Thus, the compiler forces value to be placed at the end of the RcBox, so that it can be efficiently accessed by an Rc even after an unsized coercion.

3 Likes

The main problem with this is that it doesn't compile.

Only the last member of a struct may be ?Sized. The problem is that strong and weak wouldn't have constant offsets from the beginning of the struct for dynamically sized types.

1 Like

Which architectures are we talking about? How popular are these? How often are they used?

I don't see much point discussing something which may affect god-knows-what god-knows-when without any numbers.

It makes no difference on x86, it makes no difference on Arm, it makes no difference on MIPS, it makes no difference on Sparc and on Power, it makes no difference on RISC-V… so what precisely are we talking about?

In fact the only popular architecture where the difference can be observed is x86 and then, only in code size.

Which is probably not a big enough issue to make it harder for the ICF to do it's thing.

Not saying this is really beneficial, but you can avoid this problem by keeping the current RcBox layout and storing a pointer to the start of the T instead of the RcBox<T> in Rc. Something like this Compiler Explorer

2 Likes

I guess you are right, it is more just a theoretical question. I was thinking about the AVR 8bit instruction set. There, when you make an indirect addressing with the X register, it is not possible to add an offset to it. Probably it is a negligible use-case.

You wouldn't even think about using reference counted heap-allocating pointers on a microcontroller, though, so it's probably a non-use-case.

I can see a use-case for ref counting in such restricted environments, too: In a C++ project I was using ref counted objects from a statically allocated pool. Once the object was destroyed, then its cell in the pool was marked as free. So it was a kind of half-way between static allocation and full heap implementation.