Strange (?) comment in raw_vec

I just noticed this comment, which I find a little strange

        // Allocators currently return a `NonNull<[u8]>` whose length
        // matches the size requested. If that ever changes, the capacity
        // here should change to `ptr.len() / size_of::<T>()`.

here in the standard library raw_vec source.

How can it know?

[ I got here because my own implementation of Vec didn't use any extra capacity returned, and I wanted to see what the standard library did ]

The default allocator behaves that way, and GlobalAlloc only lets you override it with a custom allocator that also behaves that way.

Ok, I didn't think of that, but people using custom allocators and nightly could be allocating with a different allocator?

The nightly-only Allocator trait does allow for defining such an allocator. I don't think it works with #[global_alloc], but it can be used as the A parameter to Box/Vec.

The linked code must work with any type that implements Allocator though?

That's right, so for unstable users of stdlib, this implementation is not optimal.

As an afterthought on this, plain std::Vec is currently inefficient because there is currently no way for Vec to use extra bytes typically allocated by an allocator.

This could be fixed for unstable by allowing a global allocator to be specified using the "new" Allocator trait.

When Allocator becomes stable, Vec would then become efficient. But I think stable performance can improve even before Allocator is stable, if the default Global allocator was specified this way?

I think this tends to affect Vecs where the element size is not a power of two, but of course it depends on the underlying allocator.

Note that this isn't that big a deal; when it tries to realloc, if the new size fits inside the existing super-sized allocation, it should end up being a no-op.

Well it will allocate earlier as a Vec grows, because it doesn't know the actual capacity that is available. That earlier allocation will (generally) involve copying the elements. It is (I think) a clear inefficiency.

For example this program will allocate just once, but using the commented out Vec instead, it will allocate twice. Even though the same strategy is being used ( start with 4, then double ). VecA discovers that after the first allocation of 4, there is actually space for 5.

use pstd::VecA;
use pstd::localalloc::*;

fn main()
{
   let mut v = VecA::<_,Perm>::new();
   // let mut v = Vec::new();
   for _ in 0..5
   {
       v.push((99,99,99));
       println!("len={} cap={}", v.len(), v.capacity());
   }
}

What happens when you push a sixth element, forcing a realloc anyways? Now your "efficient" version is actually slower, since you have one more element you have to copy over, and you end up using the same amount of heap anyways. By being too stingy with heap space, you ended up wasting CPU cycles. This is too case-specific to be "objectively" better.

And my point was that if you reallocate to a size already covered by the allocation you were given, that shouldn't require any copying. The allocator should essentially just return the allocation back as-is. So if the allocation behind the commented-out version really had room for ~8 elements (depending on the implementation details of how much extra capacity Vec allocates), then the "reallocation" should be essentially free. Yes, it technically still has the cost of a call to the global allocator, and the checks required by whatever that allocator happens to be, but when talking about heap allocation you can usually factor those out as rounding errors.

EDIT: here's a demonstration:

use pstd::VecA;
use pstd::localalloc::*;

fn main()
{
   println!("pstd::VecA at 5 elements");
   let mut v = VecA::<_,Perm>::new();
   for _ in 0..5
   {
       v.push((99,99,99));
       println!("len={} cap={}", v.len(), v.capacity());
   }

   println!("alloc::vec::Vec at 5 elements");
   let mut v = Vec::new();
   for _ in 0..5
   {
       v.push((99,99,99));
       println!("len={} cap={}", v.len(), v.capacity());
   }

   println!("pstd::VecA at 6 elements");
   let mut v = VecA::<_,Perm>::new();
   for _ in 0..6
   {
       v.push((99,99,99));
       println!("len={} cap={}", v.len(), v.capacity());
   }

   println!("alloc::vec::Vec at 6 elements");
   let mut v = Vec::new();
   for _ in 0..6
   {
       v.push((99,99,99));
       println!("len={} cap={}", v.len(), v.capacity());
   }
}

On my machine, that results in:

pstd::VecA at 5 elements
len=1 cap=5
len=2 cap=5
len=3 cap=5
len=4 cap=5
len=5 cap=5
alloc::vec::Vec at 5 elements
len=1 cap=4
len=2 cap=4
len=3 cap=4
len=4 cap=4
len=5 cap=8
pstd::VecA at 6 elements
len=1 cap=5
len=2 cap=5
len=3 cap=5
len=4 cap=5
len=5 cap=5
len=6 cap=10
alloc::vec::Vec at 6 elements
len=1 cap=4
len=2 cap=4
len=3 cap=4
len=4 cap=4
len=5 cap=8
len=6 cap=8

(a funny side-effect is that, since you double on 5 instead of 8, you also end up with two more unused capacity)

I suppose we can imagine that the unused part of the allocation is O(1) and relatively small in practice, but suppose that the allocator rounds up all allocations below some size to the nearest greater power of two, with enough constant per-allocation overhead that the allocation size used for a small-to-medium Vec consistently needs to be rounded up to the next power of two (but such that, with the constant overhead subtracted, the capacity available to the Vec is not quite double, so doubling its capacity will genuinely reallocate).

This would consistently leave nearly 50% (and up to nearly 75%) of the allocated heap space unused (for all but large Vecs, whatever the allocator considers “large”).

Of course, this could be seen as playing devil’s advocate; you could blame this poor performance on the allocator instead of on Vec.

That's indeed the problem scenario for a non-greedy collection type. In that degenerate case, you're wasting large amounts of memory, as well as CPU cycles. In fact, if I remember correctly, most modern allocators work with buckets, which leads to exactly the situation you describe with steadily increasing powers of two. However, with no numbers or proof beyond my own imagination, I don't think that's a real issue real programs face. If it were, why would allocators (which have been optimized for decades for realistic allocation patterns) work in such an adversarial way?

In any case, remember that, if you know the allocation is large enough for 5 elements, you can just do Vec::with_capacity(5) to tell the Vec that. This has the benefit of being tuneable per-allocation, and guarantees that you won't have to reallocate at 5 even if the allocator decides to hand out a properly-sized allocation. In my opinion, it's much better to just ask the allocator nicely, rather than use underhanded manipulation to get the allocator to maybe give you what you want. :slightly_smiling_face:

Yes, most allocators will allocate in sizes that are powers of two ( although it is certainly possible to imagine an allocator with denser or less dense size classes ).

If the element size is also a power of two then there is no problem, the requested allocation of Vec will always match the allocated size exactly.

But for other element sizes, there will be extra bytes allocated that are not used, resulting in early reallocation as the Vec grows ( assuming it grows one element at a time ).

So it causes extra allocations and uses more memory. Of course the overall effect is probably quite insignificant, but Rust generally tries to be as efficient as possible.

It doubles the capacity partly because it cannot (currently) rely on the allocator returning the amount available. Although for an allocator that only allocates the exact amount requested, that could cause reallocation every time an element is added, which would be very undesirable. But I think it could use a smaller muliplication factor, say 1.5, and rely on most allocators allocating only in powers of two. But that doesn't quite work currently (for all cases). If the allocator actually (to trade speed for less memory usage) allocates less, the allocator's preference will be respected (more). It is certainly quite a complex situation.

Oh, also when Vec shows a certain capacity, that is not necessarily the amount of memory allocated. I think in both cases the amount allocated is doubling 64, 128, 256... bytes, the difference is that Vec increases the allocation earlier.

The element size is 12 bytes ( 3 x i32s ).

4 elements requires 48 bytes, but that gets rounded up to 64 bytes, meaning there is space for an extra element.

10 elements needs 120 bytes, which is the next step up ( fits in 128 bytes ).

20 elements needs 240 bytes, and now there is space for 21. Etc.

I assume that the fixed-size buckets don’t need per-allocation footers or headers (since their most common purpose is to indicate the allocation size). It’s fixed-size power-of-two buckets with footers or headers that triggers the bad behavior.

That is correct. But still, the current situation is (slightly) sub-optimal. Also being able to depend on the true allocation being returned enables new possibilities (involving allocating less memory at the cost of more resizing, a typical trade-off ). But that is simply an argument for stabilising the Allocator trait ASAP, which is certainly desirable.

Turns out glibc malloc does, though other modern allocators don't. I'd be curious to see exactly how much of the performance difference comes from that vs. the other performance downsides attendant with headers.

What new possibilities? In any scenario where performance is critical enough to be impacted by this, you really, really shouldn't be relying on implementation details like these. You should be preallocating with known capacities.

To be clear, I'm not contending that the status quo doesn't waste space in the worst case. However, I am arguing that this isn't a real issue, and that changing it could have unintended performance regressions. If your types are large enough, having to copy a few more over could have real performance implications. I don't think that's a real issue either (again, preallocate), but it shows that the difference isn't an objective improvement, but rather a tradeoff. One that would need real benchmarks and real applications to properly evaluate.

I just considered the case where the element size is 9 bytes ( this would be a bit unusual, as typically there will be something with alignment of at least 2 or 4, but it is certainly possible ).

Cap 4 => 36 bytes. 64 bytes allocated, 28 bytes spare, enough for 3 extra, a total of 7.

Cap 8 = 72 bytes, 128 bytes allocated, 50 bytes spare, enough for 4 extra, a total of 12.

Vec will allocate 4 initially, and 8 when 5th element is pushed. It will allocate a 3rd time when 9th element is pushed.

VecA will allocate 4 initially, but gets 7.

It will increase the allocation when the 8th element is pushed.

Currently it asks for 7 * 2 = 14, which is a "mistake" as only 12 are available in 128 bytes.

So a better idea is to ask for less, say 1.5 times the current capacity, in this case 10, it will get 12.

Then it will allocate a 3rd time when 13th element is pushed.

I am considering amending the code to use 1.5 as the "capacity multiplier" rather than 2.

[ Edit: oh, the above is all nonsense, my arithmetic is wrong, 128 / 9 = 14 ! There is no"mistake, and never can be, doubling the capacity will never require MORE than twice the current allocation. ]

[ Edit 2: I think the idea of using a 1.5 times multiplier is still reasonable, if the allocator has smaller size class jumps, this will use them. This defers that decision to the allocator. If it doesn't, the result is the same as using 2. Still, I think I will leave it at 2. ]

Yes, and again, this might be a good thing in certain scenarios. You have fewer elements to copy to the new allocation, since you reallocated sooner. In fact, as you find ways to increase the wasted space, you (by definition) find places where CPU time was saved by minimizing copies. For some applications, it's really bad to waste that kind of space; for other applications, it's worse to waste that kind of time.

This argument doesn’t convince me. At best, for amortized O(1) push operations, you’re just shifting the latency spikes; no change in average performance. (Actually, possibly even a marginal improvement in average performance, due to marginally less frequent reallocations.) Sure, technically, any changes in the precise timing of higher-latency reallocations could harm something’s performance, but software that cares a lot about the latency of reallocation should presumably be managing it manually.