Understanding Box

I'm wondering whether I understand Box.

A Box points to data on the heap, so its size on the stack (pointing to the heap) is known at compile time. Not only does a Box take a fixed amount of space on the stack, it takes a small amount of space on the stack (handful of bytes, if you have big hands on current machines) to point to potentially a lot of data (many megabytes to gigabytes to how-big-is-your-RAM). And apparently a Box can point to recursive data. (Hand-waving here, that's what they say, I haven't played with recursive data structures in Rust, yet.)

But now I am running low on the virtues of Box, right?

Stepping back…

Computers very much live in the call stack. Rust is about lifetimes and the clean case is analyzing ownership of data on the stack. Rust has ways of escaping this stack-centric world (Arc<Mutex<>s and friends, or giving data to other threads when it is Send or…), but Box is not one of the ways of escaping stack-world, Box is thoroughly fixed in a stack-centric world…right?

Thanks,

-kb, the Kent who sometimes asks legalistic questions

Indeed, those virtues are limited. The ones you’ve listed include

  • small size of the Box<T> when T can be large
    • which can be a virtue because it makes moving the Box a lot quicker
    • or can be useful for data that’s unlikely to be accessed to be kept away from your important data,
      so that for example cache-locality benefits on that important data can be better
    • edit: additionally, of course, at some point, preventing stack overflows by just reducing stack usage can be a use-case for Box<T> with a large T, too, though if it’s a very large T, the construction of a single Box<T> in the first place can be hard without stack overflow issues
  • fixed size, no matter what you're pointing to, which enables recursively defined data structures that would otherwise error with some note of the type having “infinite size”

another important virtue that you haven’t clearly addressed, but also in the “fixed size, no matter what you're pointing to” category is

  • fixed size, no matter what you're pointing to, which can be dynamically sized data, so that
    • you can abstract over array lengths of owned arrays, and treat [T; N] and [T; M] as the same type by putting them into a box, as Box<[T; N]> Box<[T; M]>, then erasing the length by coercing to Box<[T]>
    • you can abstract over owned values of types implementing a particular trait with boxed trait objects

but maybe you meant to not address the somewhat special case of “Box of !Sized data” and look just at normal types for targets, in which case, besides the support for recursive structures, it does indeed boil down to Box being just a tool to add a heap allocation and a level of indirection for essentially zero benefit beyond the fact that you now have a heap allocation and a level of indirection, which indeed is often not a benefit at all (but I’ve outlined the corner cases where the indirection can be beneficial, in case of data where you access the owner often, but the data only very very rarely).

Actually, now I’m noticing I’ve missed a subtly different use-case. Whereas the think I explained above was the case where you want e.g. the cache-locality of large, unlikely to be used data behind a Box, a perhaps even more compelling case is when you are unlikely to have a Box at all. Namely, e.g. Result<T, Error> with a large but unlikely to occur Error type can inflate the size of the whole Result a lot, unnecessarily, so Result<T, Box<Error>> can be beneficial to keep the Result small (for small T). (Presumably, as Result<T, _> is commonly a return type, and you often use the same error type in many places, this thus saves stack space and/or register usage for many move operations for those returns.)

10 Likes

Thank you for your careful reply, I'm going to have to read that a couple more times, thank about it, and then again tomorrow.

But my idea that Box is stuck-in-stack-land is still true?

-kb

(I really like the cache aspects of your answer.)

Yes, Box offers no different ownership model than directly owned data on the stack. As such, it is sometimes taught that Rust learners don’t really need to learn about stack and heap too deeply at all (at least initially) because Rust’s ownership model works independently from stack and heap; while it models the stack, the data, e.g. through Box can truly live on the heap, too.

3 Likes

Yes, the benefit of a detailed error when there is an error, but not paying that price in the "happy case" of no error is a real virtue: The size can be small in must cases yet detailed in some.

-kb

For cache locality, there can be other solutions by the way. Some array of structs (or let's use tupless), like Vec<(SmallCommonlyAccessed, LargeSeldomlyAccessed)> can have the operations that only use the commonly accessed part benefiting from cache locality by usage of Vec<(SmallCommonlyAccessed, Box<LargeSeldomlyAccessed>)>, indeed, however, arguably even better could be usage of an “structs of arrays” approach, using a (Vec<SmallCommonlyAccessed>, Vec<LargeSeldomlyAccessed>)-like structure instead (though ideally, you would have both these Vecs also have a common len and capacity - I’m not sure how easy struct-of-array is to pull off with existing crates, as I’m not familiar with the relevant crates.

The data layed out this way has many extra benefits; if you only access the SmallCommonlyAccessed values, they are even more compactly packed with even the Boxes out of the way, you also save all possibly existing wasted padding space, and in the occasion that you do use some LargeSeldomlyAccessed values, you have none of the downside of having used Box, especially a linear scan is still nice to caches.

2 Likes

I'm finally getting good at (ab)using Arc<Mutex>>, and I am guessing I am tossing them into my code too easily, which has me asking fundamental questions about data ownership, and concluding realizing that there is a big split between the land of the Stack and the land of the Heap, a split that Rust does not magically smooth over.

So if I have data that cannot live in a single-threaded call hierarchy, then I have to use some container such as Arc<Mutex<> in which to put it, and Box ain't it.

But efficiently so, and do I really need to.

-kb

Note that Mutex can be useful without Arc, too, e.g. they can live in globals; they can live up the stack, even from other threads, due to thread::scope, similarly they can live in an owning future in the async world when you use join and similar ways (instead of spawning your data-sharing futures as separate tasks) [of course, make sure not to hold over await and dead-lock], etc…

(Similarly, of course, Arc has many use-cases without Mutex, too, especially whenever you want to cheaply share immutable data between many places without clear ownership. Or also for copy-on-write use cases.)

Also, indexing can be a powerful alternative to usage of pointers, and can simplify ownership, especially for graph-like structures that are otherwise prone to making you use Arc (and Mutex) very quickly.

1 Like

Don't put too much significance on the stack. A Box does not have to be on the stack, and in fact frequently isn't. A common beginner misunderstanding is that Rust types can be divided into “keeps their data on the stack” and “keeps their data on the heap”, but that is not at all true: rather, some types like Box create heap allocations, and that is intrinsic to them, but the owner of the Box value can put that data (the pointer) wherever it likes, including in another Box, a recursive data structure built out of Boxes, a Vec, etc.

The Rust lifetime system is very much designed to fit the needs of stack variables, but it can be used other places, and ownership is not really about stack vs. heap at all.

The more meaningful division of kinds of ownership-of-data, instead of “stack vs. heap”, is “inline vs. out-of-line/indirect”. Data can be stored directly in the type that owns it, or it can be stored in some memory pointed to by some pointer that is stored in the type. (Or other things can happen, but these are the bread-and-butter common cases that applications' data structures are built out of.)

6 Likes

If I understand the axis correctly, what you're calling "stack land" and "heap land", I would call "[exclusive] ownership" and "shared ownership".

1 Like

Hmmm.

So though Box doesn't solve any shared ownership problems but that doesn't mean it isn't useful in purely heap-based circumstances. And there its feature of a little-ish fixed sized pointer to arbitrary sized data is useful. And, being a smart pointer that might point to different kinds of things, makes very interesting data structures possible.

My imagined interesting data structure might be owned by a single thread, or it might have a single big mutex policing multiple access.

I think I need to look more at what I've dismissed as "recursive data structures".

Very interesting posts here, thanks to all. I need to sleep on it, and reread this thread tomorrow.

Thanks again,

-kb

Not just shared vs exclusive, but I think I was being distracted by the hierarchical aspect of the stack. It is important to how computers run and what Rust does, but Box<> is not just useful in Stack Land, it is very useful in Heap Land, too.

I need to:

  • Look more at how one builds nasty, complicated data structures in Rust.
  • Think about cache issues and indirection into cache-savvy single-user containers.

Cool stuff.

-kb

Another aspect of Box that hasn’t been mentioned yet is that its contents have a stable memory address, even when the Box is moved, which means that you can safely create and pass around a Pin<Box<T>> or otherwise pull interesting pointer tricks in unsafe code.

1 Like

Pin<Ptr<T>> is always moveable (all owned[1] values are), and every T value in a Pin<Ptr<T>> has a stable address unless T: Unpin (that's its purpose). And if not... you can move it.

The box's allocation may have a stable address when moved... but every other pointer still points to the same place when moved, too.

I think it's only Box's ownership properties making it stand out in the Pin world.

(IIRC its ownership semantics are too strong for some stable address tricks, but I can't back that up immediately.)


  1. like, in a variable ↩︎

2 Likes

That’s probably true, but there are at least some that Miri accepts. That’s no guarantee that it’s actually legit, of course, but it’s at least strong evidence.

(Example adapted from here)

1 Like

I was thinking of things like this. Looks like it's somewhat of an open question where the guarantees lay.

(Not diving the details myself, at least not now.)

1 Like

Box is important for FFI. If you want to give an owned value to a C program it should be #[repr(C)]. At the cost of some indirection you can hide (Sized) Rust types behind a pointer with Box.

2 Likes

Now that y'all bring it up, let me add another question to my queries about Box: If I call another function and simply pass in a Box (that is, a smart pointer to the Box on the heap), the value will "move" (a) to that function and once that function returns I can no longer access it. But the Box's contents in the heap didn't have any reason to move.

When (and why) might a Box's location in the heap "move" (b)?

And, above my "(a)" and "(b)" are different senses of the word "move", right?

-kb

The Box's allocation address doesn't change,[1] but you can move the value stored there out. (Pin restricts the moving of values.)

fn example<T>(mut bx: Box<T>, mut other: T) {
    // Box is special in allowing you to move out of it with `*`
    let this = *bx;
    // And to re-initialize it too
    *bx = this;
    
    // You can also move values this way
    std::mem::swap(&mut *bx, &mut other);
}

When you move the Box[2] you're just copying a pointer (and making the place you're copying from uninitialized, because Box<_> is not Copy). You're not moving the value it points to and you're not changing the allocation address.


  1. (b) ↩︎

  2. (a) ↩︎

3 Likes