Box<> memory overhead

Hi there, I am experimenting with some toy programs, and wanted to understand how much memory overhead does Box<> type incur? I am not able to decide if it takes 8 or 16 bytes?
Here are a few examples I tried and am reporting the memory I measures by looking at heap allocations via Instruments on my mac. I am using the latest stable rust.

For each of examples, I am using a Vec<> with_capacity(16777216) to rule out any extra slots allocated inside the vector. (this ensures that vec.len == vec.capacity):

- struct S1 {
    a: i64,
}
- struct S2 {
    a: i64,
    b: i64,
}
- Box<S1>
- Box<S2>
*** Memory Measurements ***
Vec<> of size 16777216
Mode	 MB	   Bytes/Obj
S1	     128	8
S2	     256	16
Box<S1>	 384	24
Box<S2>	 384	24

Why is it that Box<> incurs an extra of about 16Bytes between S1 and Box<S1>, but only 8 Bytes overhead between S2 and Box<S2>?

It appears that in both the cases when allocating Box and Box on Vec, there is an initial single malloc of 128MB (for the Vec::with_capacity()), and then 16777234 allocations of 16 Bytes each, totaling 256MB in both the cases.

Does this appear to be some fault in the measurement? Or something to do with allocator? Or is this totally expected?

Allocators commonly round allocation sizes up to enable more efficient tracking. It may be that your system allocator’s minimum allocation size is 16 bytes.

3 Likes

Yes, this looks like the case: where each allocation is of min 16 Bytes. But in this case, would the extra 8 bytes allocated be just wasted?

In other words, if I allocate a ton of 8-byte objects on my setup (OS X in this case), then half the allocated memory will be always wasted?
(I have edited the original post with output from Instruments and screenshots)

1 Like

In general, lots of small allocations is detrimental from both a CPU performance and memory efficiency point of view. If you have to do it, it is better to allocate and deallocate lots of objects at once, look up “arena allocators” for more on this strategy.

4 Likes

Allocators often allocate a bit more memory than you’ve requested for bookkeeping purposes. For example, most allocators are designed around the C language malloc and free operations. In C, you pass the size of the desired allocation to malloc, but not to free. This means that free has to somehow decide how much memory you are returning to the allocator.

The most common way of doing that is to reserve a bit of additional space, often just before the start of the allocated memory, to store the size. The amount of space allocated is typically one usize (in Rust, size_t in C), which on your platform is 8 bytes.

Some allocators will store additional information, like a magic constant so they can try to detect you free-ing something they did not allocate. This is often packed into the same 8-byte bookkeeping word, but sometimes costs more words.

Finally, as @sfackler noted, allocators tend to keep track of memory by lists of common sizes, often powers of two. So small or odd allocations will tend to get rounded up to a more convenient size.

So, the behavior you’re observing is specific to (1) your platform (Mac) and version, and (2) the allocator. You can change the memory allocator in Rust, which will result in different behavior.

As @HadrienG said, if you want to allocate a lot of small objects without overhead, you probably don’t want to put each one in a separate heap allocation (Box etc.). The easiest way to manage a bunch of small objects in a single heap allocation is by putting them in a Vec.

5 Likes

One way to test the “allocator theory” is to also define S3, S4, S5, S6, S7, S8

and see if the same pattern emerges.

1 Like

Thanks, everyone. It appears that the default configured allocator (I still need to verify which one is it), is allocating memory only in 16B multiples.

Because the System allocator uses standard malloc(3) family, and x86_64 abi requires that allocated memory from them must have at least 16 bytes alignment. But it’s not the Rust allocator’s requirement, so it’s totally legal to write global allocator to produce (literally)odd address for Box<u8>.

2 Likes

You could use the jemallocator crate it’s smallest size class is 8 bytes. But Boxing tons of small objects implies some design issue, are you sure you can’t put multiple in other collections, Vec, HashMap, Etc.

I’m not convinced this statement is true. A typical way of storing a tree/forest involves a “small object” (Rc/Box) for every interior node.

1 Like

Just because something is typical doesn’t make it efficient. In fact typical patterns tend to be less efficient in order to be more understandable and simpler to implement. (this is usually a good thing, as simpler constructs are less error prone and easier to maintain)

1 Like

I should have more specific I meant if the object was 8 bytes. It takes at least 24 bytes for a node in a standard pointer based tree on 64bit(2 child pointers and data that’s at least 8 bytes aligned). But there are gains from using u32 (or smaller) indexes into a Vec for trees and graphs. Smaller, better locality, less allocator overhead and passes the borrow checker easily. Though the borrow checker can’t help find bad “pointers”.

@KrishnaSannasi : I am not claiming that “one RC/Box per node” is the most efficient way of representing a tree.

I am arguing against “tons of small objects implies some design issue” by pointing out that there exists perfectly reasonable situations for having tons of small objects.

I also agree that there are situations where using tons of Rc/Boxes is too slow and using indexes is better.

1 Like