Special heap allocator to track data structure memory

I just implemented Disk-based Hash Aggregation for PostgreSQL, which is in C, and I'm wondering how I might have done something similar in Rust. The basic idea is to aggregate groups in a hash table, and when the total size of the hash table and all its contents get to be too big, start writing some data to disk to be processed later. Note: swapping or similar is not effective here -- that would be far less efficient.

The details of that algorithm aren't important, but the key part is "when the hash table and all its contents get too big". In Rust, there's no way to get the size (in bytes of memory) of a HashMap and all its contents (including referenced data that is logically a part of the hash table but allocated separately).

It would be nice if standard data structures could accept some kind of custom heap allocator that would track the space that the structure is really using, and the algorithm could just ask the heap how much memory it's using.

Is there a way to do something like that currently? Does anyone have any suggestions?

I could also imagine a heap being passed around implicitly for every function call. If you don't pass a new one, it inherits the caller's heap. There are probably a lot of problems with that idea, but it seems like some language support might help for dealing with arenas, special allocators, etc.

Regarding the total memory used by the hash table:
Using mem::size_of and HashMap::capacity it should be possible to compute the memory the HashTable uses. This excludes memory from references as you already mentioned.
As far as I know there is no way in the std to recursively get the memory required for a type including its references.
You could implement this yourself, but keep in mind that you will have to do some kind of loop detection if your data structures could contain cycles.

Regarding the custom heap allocator:
I see where you come from and I thought about this too. The only data structure in the std I could find that had that kind of interface is the RawVec (I think that was its name).
But honestly I don't see how this solves your problem, as referenced data would not necessarily be allocated using the same allocator. This introduces some more problems:
Assume you have a some type that contains some heap allocated elements. Currently there is no (simple, efficient) way at runtime to know from which of your hypothetical allocators that allocation comes from, an information that is required when dropping the type.

Walking through the structure is not going to be an efficient way to just ask "is the structure too big yet", so that's not really a great solution.

FWIW, postgres is able to know which "memory context" (the postgres name for an instance of an allocator) allocated a particular chunk because it actually stores a pointer to the context in front of the data. That may not be a cost that Rust wants to pay, but perhaps there are ways of optimizing it for the normal case where it just uses the global allocator.

It seems like there should be some way to do this, and it seems like it could be great in a number of ways. For instance, if you know that a certain structure doesn't free very much, or it is short-lived, you can just use a bump allocator. Postgres has several allocator types to use in different situations: a normal allocator, an allocator optimized for freeing the oldest data first, and an allocator optimized for same-size chunks.

From a purely theoretical point of view it should be possible to do this efficiently.
If for example each allocator has a distinct memory range in which allocated memory is contained, and these could be controlled from the program (this implies that the program tell the os in which range to allocate memory when allocating memory), given carefully selected memory ranges, it would be possible to find out which allocator allocated a specific allocation using more or less a bit shift (this assumes there are only a fixed number of allocators known at compile time)

I don't think it's reasonable to assume that the number of instances of allocators are fixed. There may be a fixed number of types of allocators, but each data structure you create may need its own allocator with its own memory bound.

There may be some way we could tie the allocators to the ownership system, which would be very interesting. We may be able to know everything we need to know at compile time.

1 Like

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.