Allocators, future


#1

https://doc.rust-lang.org/1.12.0/book/custom-allocators.html

this looks awesome, great to see the ‘size-passed-into-free’.

has anyone considered the option of a hints word passed to the
lowest level of the allocator, e.g:-

  • use hints r.e. threading - thread-local vs inter-thread use
  • use hints r.e. lifetimes - frequent/short vs infrequent/long, grouping ('these things will be alloced/freed together)
  • use hints r.e. device use, GPU-able vs CPU only (see modern gpu unified memory)
  • other possible use cases on NUMA systems (embedded machines with scratchpads? supercomputers?)

maybe something like a general purpose 32bit hints field, which can’t change the semantics , but which can guide choices that could have benefits in these scenarios. the hints could be devised such that passing ‘0’ is a safe assumption for most programs.

in particular the use cases I had in the past involved a big dividing line between allocations r.e. lifetimes - persisting for multiple frames, level load, level load temporaries, intra-frame temporaries, texture-streaming buffer. I’m sure people in other domains will have had different requirements and issues.

IMO allocators per type as seen in C++ can be quite useful, e.g. in the above case you have ‘entities’ , ‘vertex data for the gpu’, ‘texture objects’ which are all distinct types with distinct allocation behaviour ,but you might be able to get better control than C++'s system allows.

( I know that custom buffers owning the objects is another way to control things , arguably better… and more useable at a high level today with ‘emplace back’; again perhaps that could back onto the same underlying hint word passed to the lowest level allocator )

(links from below … scenarios that might demand some unusual treatment memory
PGAS
chinese supercomputer
movidius
nvidia unified memory
SSI
thread local storage
)


#2

I think the “right” way to handle this is by allowing custom allocators in data structures. That way you, as the code author, can pass a tailor made allocator to a given data structure because you know the characteristics of how the objects managed there are (de)allocated.

Generic hint bits seems a bit unprincipled and would also slow down some allocators as they’d need to examine the bits and probably branch on them.


#3

ok so that maps onto my picture of keeping ‘entities’ in an ‘entity buffer’, etc
it also maps onto how we have per-type new/delete overload in C++, e.g. you could have given the ‘entity’ an operator new() that maps onto use of a specific pool. (which could just have been partitioned off a big chunk given by the default)

Generic hint bits seems a bit unprincipled and would also slow down some allocators as they’d need to examine the bits and probably branch on them.

it might seem like bloat but perhaps the unhinted versions remain, and these are assumed to be equivalent to passing a flag word of zero. (ignore the hints or generate by inlining?) That would circumvent any performance hit . but I certainly see what you’re saying about ‘allocators in data structures’.
Actually, might hints starting out as constants, passed as flags , be inlineable anyway (although I agree it might be better to avoid relying on such compiler magic)

I still wonder about the NUMA case, and there’s embedded machines with scratchpads. the chinese supercomputer is out there with such a design… its possible these things will appear for AI work, e.g. see movidius with it’s on chip 2mb area for sharing data between it’s worker cores.

Is it possible this could map quite closely to an idea of hints for lifetime of allocations and use between threads. (see also thread local storage. In the movidius/scratchpad case what you essentially have is permutations of single core use/ inter-core use, and long vs short lifetime. (the movidius case is like ‘shared but short lifetime’ , I see it a bit like an ‘L2 scratchpad’, whilst other scratchpads are more like ‘hardware thread-local storage’)

see also PGAS

Whether it’s embedded or data centres and supercomputers… these use cases do exist… there’s also things like SSI which might demand paging hints

I suppose most people would want the simplest possible management of scratchpad memory, one reasonable scheme is 2 opposing stacks (the callstack/locals grow up, a second allocation stack grows down)… even a stacklike hint bit might work ('these are allocations that will be unwound in a stack like way…) actually dual stacks (one on chip, one off chip) is another idea


#4

(so imagine
default assumption … ‘small size’= more likely to be local, short; ‘large size’=more likely to be global, long.

overide with 4 bits

0000 default guess based on size.
0001 short lifetime
0010 long lifetime
0100 thread local
1000 thread shared

default unmarked case uses 0000
passing a flag - the program may only care about one of these special cases explicitely; if that flag is found, call the alternate version, otherwise fallback to default.
… hope that this codepath inlines, if LLVM can figure out it’s a constant being passed

I note the allocator is using ‘no-mangle’ , c compatible calls; I guess that would rule out using a type-parameter. maybe too weird to define versions with every permuation of bits, but what if only one bit really matters, alloc_local alloc_global etc might be doable?
)


#5

I think placement-new is more relevant? In general, a type shouldn’t know/care where it’s being allocated - that’s decided by caller/user of the type because the right “arena” may depend on where the type is used. For arenas themselves, the allocator itself is important for the raw block storage characteristics (alloc/free behavior, NUMA affinity, etc).

While performance may not be impacted given sufficient inlining, constant folding and DCE, I think hint bits also don’t version well or generalize. There may certainly be some general hints that any allocator may care about, but there will be allocator-specific things that one might need (eg an allocator for a niche platform). Or some hints may only be applicable to a subset of platforms. Now you need to know what allocator you’re passing the hint bits to so that you don’t accidentally give it to another allocator that may misinterpret a bit as meaning something else. And I think similar brittleness and (semantic) type unsafety explodes as you get further into this.

Rust has a very strong type system, and allows for sprinkling types willy-nilly without incurring runtime performance hits (compile time will suffer but that’s a different beast altogether). I think it makes sense to encode functionality and semantics into the allocator types themselves and then use the type system to ensure the right allocators are used where desired.

I’ve not thought this through too much so might be missing something fundamental.


#6

r.e. allocator / types, my view is that a hint on the type can be useful, because sometimes types are specific to certain uses (e.g. in the game engine example, there are ‘entity/actor’ types, and graphics-resource types (vertices etc); these have different allocation and useage characteristics. or in a desktop app you have types representing components of the UI, other types associated with application data, etc

So you could have layers of specialisation…

  • start - defaults, completely general

  • then overide with per type information. (‘this is a graphics type, it’s likely to go in the graphics pools…’)

  • then allow optional override with situation specific information (you’re creating this type in a specific buffer…)

  • Or some hints may only be applicable to a subset of platforms.

if the hints-bits will end up as a compile time constant this might not matter ; i do envisage some subset being important ,and an allocator will really only be dealing with a small number of cases.

of course instead of hints bits we could do this with type-params , but I see that the low level allocator interface is #no_mangle and probably setup to interface directly with C like allocators.

if we had a 32bit flags word, i think 4-8 bits could express enough that is very useful for the existing range of devices (considering the special cases listed above - which is probably a superset of most use cases) - whilst leaving plenty of room for expansion.

I do like the idea of hinting it with type-parameters; what we really want is inlining (the flags never really exist) but for an ‘unmangled C abi interface’ we’d end up with an explosion of allocator interfaces __rust_allocate_L2_scratchpad() __rust_allocate_thread_shared() etc.

i suppose we could just look at something one level up, based on Rust-friendly type-parameters, that dispatches to a known subset of these… maybe calling platform specifics directly… so really we’d be bypassing this interface