Why does Arc use one contiguous allocation for data and counters?

As far as I understand, the internals of Arc look roughly like this (ignoring some details that won't play a role in this question):

pub struct Arc<T: ?Sized> {
    ptr: NonNull<ArcInner<T>>,
}

struct ArcInner<T: ?Sized> {
    strong: atomic::AtomicUsize,
    weak: atomic::AtomicUsize,
    data: T,
}

Note that the data and the counts are stored in one contiguous allocation. This has some annoying consequences. In particular, converting an existing array or string to an Arc often requires copying the contained data. For example, if you have a String or a Box<str>, then converting that to Arc<str> requires copying the contents of the string. It seems that this issue could be largely avoided by using an Arc implementation that looks like this:

pub struct Arc<T: ?Sized> {
    ptr_counts: NonNull<ArcInner>,
    ptr_data: NonNull<T>,
}

struct ArcInner {
    strong: atomic::AtomicUsize,
    weak: atomic::AtomicUsize,
}

This two pointer construction would allow for constructing an Arc<T> from any valid pointer to T without copying the contents, so String to Arc<str> for example would be a cheap copy of some pointers and a tiny memory allocation. As an added bonus, this construction should also allow you to construct an Arc<str> that points to a substring of an existing string while sharing reference counts with the full string (correct me if I am wrong about this).

Overall it seems that the two pointer construction is more flexible and allows avoiding some expensive copies while having no major downsides, but Rust has opted not to use it.

Why did Rust choose to use the one pointer construction for Arc? Is there some major benefit to the one pointer construction or a major downside to the two pointer construction that I have missed?

1 Like

I suspect you still believe it's XX century where tiny memory allocation was no major downside.

In a today's world, where just one, single, unpredicted memory access takes as much time as one needs to move about 1KiB of data from one place in memory to another two-pointers vs one-pointer plus tiny memory allocation is a big deal.

What you are describing is, essentially, how std::shared_ptr works and it is significant slowdown, in most cases.

Heck, once upon time, when CPU were slow and RAM was fast C++ even used Copy-on-Write strings and that was good optimization, but today… times have changed. Reducing pointer chasing is important and Rust's Arc does precisely that.

It's not possible to make Arc faster than std::shared_ptr all the time (and std::make_shared to reduce impact of pointer chasing) but If you code is dominated by conversion from String to Arc<str> then you are doing something really strange: usually you either pass String around without any need for Arc<str> or, alternatively, you create Arc<str> and then use it for a significant time which means that one-type cost of data copyng is amortized by many accesses that are faster.

I don't think the substring trick would work because on drop, the Arc has to know what pointer to free.

For other uses, it seems like you can get almost the layout you describe, with cheap (non-cloning) construction, by using Arc<Box<T>> or Arc<Vec<T>> or Arc<String>.

(Of course that costs an extra dependent read to get at the data, so it's not necessarily a good idea.)

3 Likes

I don't think the substring trick would work because on drop, the Arc has to know what pointer to free.

That's a fair point.

What you are describing is, essentially, how std::shared_ptr works and it is significant slowdown, in most cases.

Do you know of any benchmarks that demonstrate this? You might be right about this, but caching in modern CPU-s is so convoluted that I'm hesitant to believe this without any concrete evidence. Also, it would be interesting to know just how significant the slowdown is in practice.

For code that really needs different trade-offs from Arc<str> or Arc<String>, you can write your own shared pointer type, or use one like hipstr.

hipstr essentially uses (*const [u8], Arc<Vec<u8>>). You can create one from Vec or String without copying. It does not require a dependent pointer read. It can point to a subslice, but still deallocate the whole buffer on drop. (The downside is that it is a "fatter" pointer than Arc<[u8]>. But it can also use this extra space to store small strings inline, with no allocations.)

2 Likes

hipstr looks really useful, thanks for pointing it out!

You can find lots of articles that discuss that. Articles like std::shared_ptr is an anti-pattern or Memory and Performance Overhead of Smart Pointers which conclude that shared_ptr is quite slow are not uncommon.

And nobody stops your from writing your own implementation and benchmarking it.

The mere fact that C++ includes std::make_shared and advice to never use raw std::shared_ptr is very telling: this advice, essentially, makes you use std::make_shared like Arc while still having extra overhead on top of Arc.

Thus the proper benchmark is even harder to perform: you would need to measure the win that this scheme brings and compare it to overhead of something that is almost never used (but needs to be kept in code anyway).

In case of mutexes poison is widely considered like something that is almost never needed yet we still pay for, in Arc the decision is reversed.

P.S. But yeah, it would actually be interesting to have numbers: Arc was designed more than 10 years and std::shared_Ptr is even older which means that situation could have changed since then. It also depends on usage patterns and properties of memory allocator which makes situation even more interesting… and harder to benchmarks.

1 Like

At best it's "it depends". It's a different tradeoff, which has its own downsides.

If the counters get allocated from a pool of small objects (which is a common design in modern allocators), then they may suffer from false sharing and tank performance of refcount updates due to sharing cache lines with other Arc instances or other mutable objects. The counters would have to be padded to cache line size which increases memory usage.
In the current setup the payload happens to work as a padding between counters.

Reuse of an existing allocation is a thing only if you already have an allocation to reuse. It's often possible to create Arc directly from some temporary/computed/on-stack data. In that case you're avoiding calling the allocator twice.

And finally, it's possible to make your own Arc-like type if you have a case where this matters. Libstd types are slightly magical (when working with unsized types), but the basic building block of UnsafeCell is available in the stable user land.

2 Likes

Just to note, this isn't actually true given Rust's (de)allocation interface — since a dealloc requires providing the allocation size, going from String to Box<str> requires a realloc in most cases to eliminate the excess unused allocation size. The allocator might be able to do this allocation shrink as an in place no-op, or it might be a full reallocation.

3 Likes

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.