I have a few related questions about big structs. Assume a big struct is in the neighborhood of 1KB.
Say I call Box::new(some big struct value). Naively I expect the big struct to be first memcpy'd to Box::new's stack frame, then memcpy'd again into the heap. Is that really how it works in a non-optimized build?
We do tell people to go ahead and return big structs by value, not boxed, right? And we tell them not to worry about the copy, right?
What about passing big structs by value? (It seems like the copies are not reliably optimized away in practice, so I hope we tell people to pass them by reference.)
Isn't this advice questionable if we are in some cases relying on the stack being big enough to hold several copies of whatever the program throws at it, and in other cases relying on the Rust optimizer, and we don't even know which case is which because the language makes no promises at all about stack usage?
Am I crazy to feel that either Rust is a little too high-level here, or else the advice we give people is kind of bad? I want to be able to reason about stack usage in order to be confident my program won't overflow the stack and crash. I want to be able to reason about the baseline performance of the program because I do actually somewhat care about things like CI finishing in 5 minutes rather than 30.
Box::new() is specially marked in a way that that strongly encourages it to be compiled into an allocation followed by a copy, without any intermediate stack slot. I’m not sure if this applies to all non-optimized builds, but it applies to some; for example, I compiled this function on the playground, with the debug profile selected,
pub fn example() -> Box<[u8; 400000]> {
let x = [0; 400000];
Box::new(x)
}
and got code which contained no call to Box::new().
Well, there’s big and there’s big. 1 KB deserves some caution, but when things get to 1 MB you want to take special care to make sure that it never hits the stack.
Most calling conventions effectively pass large values by reference (by a pointer, often stored in a register).
All this is true even when you have small values. Are you thinking about how many local variables your function has? Are you adding up their sizes to see if they're several times 1 KB even if no single one is under?
If this is the position you’re in, you should consider using opt-level = 1. Just a bit of optimization will greatly improve performance without greatly increasing compile time, and the main thing it can hurt is a debugger’s view of the program, which you usually won’t be using in CI.
Yes, there are two different copies (Box::new itself is inlined).
In lines 323 and 331, there are two different memcpy calls that are clearly redundant (it copies 1024 zero bytes from rsp - 1032 to rsp - 8, and then to the return value of exchange_malloc).
If you optimize the code, all the copies get optimized away and it calls memset directly in the allocated memory. (Compiler Explorer)
If you want to make sure your program doesn’t overflow the stack, you can test with a lower stack size than the default and/or run it with a bigger stack size (either through linker configuration or stacker).
Such a calling convention does not prevent copies though. The calling function may make a copy and then pass a pointer to that copy because it can't be sure the called function won't modify its copy.
Thus it's better to pass large structs by reference.
This is a good point. I only worry about this when putting a largeish value on the stack, or doing deep recursion.
But maybe that is reasonable? A stack is typically 2MiB and a local variable is typically 8 bytes. Sometimes 16 or 24. You have to work pretty hard to get enough small locals and temporaries to add up to a stack frame more than 2KiB, 0.1% of the limit. I've only observed frames over 4KiB with REALLY unusual code (a bytecode interpreter written as a single function tens of thousands of lines long).
Maybe I'm more likely to overflow the stack by using too many tiny functions and forgetting they aren't inlined away in debug builds. And it's true, I don't worry about that either. It's just really hard in practice to get tiny numbers to add up to equal a large number, crossing 5 orders of magnitude.