Hello guys. I am trying to understand memory usage and stack vs. heap when it comes to performance a bit better. I think I'm aware of the general rule: stack is always faster than heap.
For the particular task I am trying to solve, I have a function that returns a 6-byte buffer, and I need to gather 50 such elements and return them. There are four functions I came up with:
// assume f() is some DB call or whatever
fn example0(f: &dyn Fn() -> [u8; 6]) -> Vec<Vec<u8>> {
let mut result: Vec<Vec<u8>> = vec![];
for i in 0..50 {
let data: [u8; 6] = f();
result.push(data.to_vec());
}
result
}
fn example1(f: &dyn Fn() -> [u8; 6]) -> Vec<Vec<u8>> {
let mut result: Vec<Vec<u8>> = Vec::with_capacity(50);
for i in 0..50 {
let data: [u8; 6] = f();
result.push(data.to_vec());
}
result
}
fn example2(f: &dyn Fn() -> [u8; 6]) -> Vec<[u8; 6]> {
let mut result: Vec<[u8; 6]> = Vec::with_capacity(50);
for i in 0..50 {
let data: [u8; 6] = f();
result.push(data);
}
result
}
fn example3(f: &dyn Fn() -> [u8; 6]) -> [[u8; 6]; 50] {
let mut result: [[u8; 6]; 50] = [[0; 6]; 50];
for i in 0..50 {
let data: [u8; 6] = f();
result[i] = data;
}
result
}
I am pretty sure example0 is the worst, since no size information is known, and a lot of allocations would have to be made.
example1 is probably a bit better since the outer vec knows it has a size of 50, but the inner elements are still not known.
example2 should be pretty decent, since it "knows" that it will have 50 [u8; 6]
Finally, example3 knows exactly how many bytes are needed, so it should be the most efficient.
Am I correct in my understanding? What are the main differences between 2 & 3, considering in both cases all sizes are known at compile time?
I think you are mostly correct in your understanding but there a couple more factors except for knowledge of size.
For example in example 0 and 1 you have a Vec<Vec>. This is relatively slow because you load a pointer from the outer vec and then load the T from that pointer. This is bad because your CPU caches won't be utilized very well.
As for examples 2: Yes the sizes are known at compile time, but for Vec::with_capacity it doesn't make a difference if the value is known at compile time or runtime. The difference to a program without that line is that the Vec doesn't need to reallocate while your loop is running, because it has all the needed space from the beginning, but it still has to allocate once.
That is also the difference to example 3. Here no allocation happens, so you just do less, which is always faster. That is also a big reason why the stack is faster. You don't need to call the allocator.
example0 has about 55 heap allocations: 50 for the little vectors, ~5 for result as it grows dynamically.
example1 has 51 heap allocations: 50 for the little vectors, 1 for result
example2 has 1 heap allocation for result
example3 has 0 heap allocations. It is the only example where the data is on the stack.
example3 can also be written in a slightly cleaner way by using ArrayVec.
Having data on the stack is not necessarily more efficient than having it on the heap. So it's not clear that example3 is more efficient than example2, it depends on what happens with the data later. When keeping the data on the stack you may have to move it between stack frames of different functions or into some other data structures which is a performance hit. So typically it's more convenient to allocate relatively large data like that on the heap, like in example2. Heap allocation is not that slow.
I would have answered it in simple words like this:
An array is just a block of storage on the stack, this is also true for a nested array where the elements are arrays again. That compact block fits good in the Cache.
So your example3 should be optimal.
Instead, a vector introduces an indirection: A struct on the stack with 3 fields, length, capacity, and the pointer to the actual data on the heap. Vector needs to allocate its buffer, and free when not used any longer. And all the vectors buffers might be distributed in RAM, so cache efficiency might be bad.
example0 has about 55 heap allocations: 50 for the little vectors, ~5 for result as it grows dynamically.
Pretty interesting, so with_capacity() is defenitely useful if the size is known beforehand.
When keeping the data on the stack you may have to move it between stack frames of different functions or into some other data structures which is a performance hit.
That makes sense, since instead of passing around 300 bytes, it's just a pointer. I guess this would also be beneficial when there are a lot of concurrent requests being served?
Yes, you are right for the Box case. What is even more important, my statement "A struct on the stack with 3 fields" is also not always true, as this struct can reside on the heap, e.g. for the cases that we have a vector containing vector elements.
To be clear you are describing a Vec. A Vec is a struct (not an array). You are right, like any other struct a Vec can live on the stack or on the heap. Consider:
let v1 = Vec::<u32>::new(); // Vec on the stack
let v2 = Box::new(Vec::<u32>::new()); // Vec on the heap
println!("{}, {}", v1[1], v2[2]);
Now, the Vec of course has some data content, in an array on the heap. No matter where the Vec struct is.