Typesafe heterogeneous Vec container


#1

I would like to create a container for a mapping from keys that have an implicit associated value type T to objects of type Vec<T>. I have successfully created such a container for a mapping of such keys to objects of type T:
https://play.rust-lang.org/?gist=4cc003f82a2d53a60a1bae1436c84ad5&version=stable&mode=debug

So, to be specific, I would like to create a struct with methods like:

set_value<K: Key>(&mut self, index: usize, value: K::Value)

get_value<K: Key>(&self, index: usize) -> Option<&K::Value>

that would set/get values in slot ‘index’ of the associated Vec<K::Value> (assuming such a slot exists).

I’d also appreciate any suggestions for making this more idiomatic if the construction I’m proposing is not ideal. Thank you!

Edit: Further information - I have tried variations on what I assumed would be the natural extension from the gist above - to replace the map in Container with a HashMap<TypeId, Vec<_>> and trying various things in place of _. But, none of the things I have considered and attempted work (generally they won’t compile). I explicitly do not want to use a Boxed type inside of the Vec because of the overhead cost (this could have hundreds of millions of indexes, for example).


How to implement a hashmap of vectors of any type?
#2

You can make Container<'a> store references inside a Vec: Vec<&'a Any>. This means that the references it stores must outlive the container. Whether this will work or not depends on your usecases. If you want the container to own the trait objects, then you need the Box.


#3

Thank you for the suggestion that I hadn’t specifically considered. Having long(er) lifetimes wouldn’t be an insurmountable issue for my use case I believe (though I would prefer a situation where the Vec will own the values), but this will still create more overhead than I’m aiming for. For example, using the notation from the gist, in the entry with KeyTwo I would like to store a Vec<bool>, and with KeyOne I would like to store a Vec<usize>. If I used references everywhere I would incur a usize (32/64 bit) cost per entry even for a long vector of boolean values. It seemed like this should theoretically be possible, looking at the actual data structures involved (a Vec seems to just be a pointer and two size integral values, and so in a sense is implicitly boxing the data it points to). But, maybe I’m missing something important.

In general I’m more interested in storing primitive values than complex structs in each Vec, for what it’s worth, but I was trying to accomplish it all generically.


#4

It’s storing the data on the heap, but the values aren’t boxed individually because they’re all homogeneous (same type, same size, same alignment, same drop glue if any, etc). If you want to store heterogeneous data, then each value on the heap needs to be a fat pointer itself, which consists of the ptr to the data and another ptr to the vtbl. So there’s really no way around that in Rust.

You can consider using an enum that captures the different types you store, rather than trait objects. You’ll still pay the cost of a discriminant but you won’t need a box.

Alternatively, split the storage across multiple Vecs, one per specific integral type.


#5

This is what I am trying to do, I believe. So, for KeyTwo, in the container map I would want to associate a Vec<bool>, i.e. every entry in the vector would be of the same boolean type. For KeyOne, in the container map I would want to associate a Vec<usize>, and every entry in that vector would be of the same integral type. So the heterogeneity is only across different key-value entries in the map, and not within any single Vec.

Maybe there is a basic solution I have missed - is there anything that could go in place of the ? here that would reflect that the values in the HashMap are vectors? (This assumes the gist code above)

let mut map: HashMap<TypeId, ?> = HashMap::new();

let mut x = Vec::new();
x.push(1);
x.push(2);
map.insert(TypeId::of::<KeyOne>, x);

let mut y = Vec::new();
y.push(false);
y.push(true);
map.insert(TypeId::of::<KeyTwo>, y);

In general, the container could have any number of keys each with different corresponding vectors, of course. But, within each vector the data would all be of a homogeneous type (specified by the key).


#6

There’s no way to express (without type erasure) what you’re trying to do because it requires dynamic (runtime) information, whereas Rust generics want static (compile time) information.

One way you can model this is via an enum:

enum Key {
    One(Vec<i32>),
    Two(Vec<bool>),
    ...
}

Then you can simply create a value with any of these variants, and it’ll require that you pass in the right value type. This is like a “single key per type” map - when you pass a Key to some function, e.g., then it’ll need to match on it and look for variants of interest.

Otherwise, since KeyOne and KeyTwo are different types (in your example), you can’t store them and their values (assuming they’re different, like in your case) in a homogeneous generic struct.

Maybe you can expand a bit on the bigger picture of what you’re trying to do - there may be a better representation/alternative.


#7

Thank you for your patience and continued willingness to provide suggestions. I will think about the enum possibility. Here is the bigger picture:

I am trying to create a toolkit for supporting large-scale agent-based simulation (100s of millions of agents at least and potentially billions), inspired by an existing piece of work of mine in Java (and as you have noticed, my design pattern does indeed reflect that type erasure paradigm). Users of the toolkit assemble their models as a series of interacting components that can make plans (ask to be notified at some future point in time to take action), register to observe state changes, and mutate the simulation environment through an API. The primary simulation use case is modeling the spread and control of transmissible diseases in populations.

The container in this question is intended to hold properties about agents in the simulation. The specific list of properties (and their associated value types) is model-dependent and programmer-specified, but I was aiming for it to be specified at compile time and not run time. So, every agent might have multiple properties reflecting things like their age group, vaccination status, social group membership, location, etc. Given the scale of the simulation it is essential that overhead for storing this data be minimized.

In my existing design I store parameter values as arrays of the appropriate types with each array index corresponding to an individual agent in the simulation. The arrays are associated to keys with corresponding types in a typesafe heterogeneous container as in my gist. Parameter values can be get/set through the simulation API with a function signature that references the parameter name (Key object above), individual agent id, and when applicable an appropriate new value for that parameter.


#8

Thanks for the background. Sounds like an interesting and useful project!

So here’s an idea, which is probably closer to what you had in mind to begin with. The (type erased) map will store the various agent properties - those will be the keys. The values are also type erased, but wrap a strongly typed Vec. There will be an extra allocation per Vec, but total # of these allocations will be equal to the union of all attributes across all models. The actual values are stored unboxed. IIUC, the # of parameters shouldn’t be too many, certainly a lot less than values per agent. There’s a bit of memory indirection getting to the data because access goes through the Box first, but maybe you can amortize that out or it won’t be an issue at all in practice.

To that end, here is a playground gist of the above. I reused your initial playground because I’m on mobile and it was easier to start with that. The API would likely be a bit different, but the storage and retrieval idea should hopefully be clear.


#9

Also, given your footprint concerns, you may want to consider value interning as there’re likely to be agents with duplicate values for some of the attributes (I’m guessing based on the background you provided). Any value whose size is > usize may be better shared by agents, by having them maintain an index into a, eg, Vec that holds the value.


#10

Why are you avoiding heap allocation? That is the simple solution here, and until you’ve profiled and proven it won’t work you should go with the simple solution. Also, you say you’re moving from Java; Java is allocating all its objects on the heap.

Obviously you know your problem better than I do, and so you should do what you need, but I see a bunch of knee-jerk ‘avoid box’ and it confuses me. Box is not evil. Everything in Java is boxed and that works fine for most people.


#11

I’m not sure why you’re replying to me nor whether you read the thread. @jasonasher is exploring the design/impl space, and he said why he wants to reduce footprint - it’s all spelled out above.


#12

Apologies for replying to the wrong person. I’m on mobile at the moment and just responded to the bottom of the thread. I certainly didn’t mean to be negative about the helpful replies. Sorry.

I can also see exploring the design space as an ok thing. But the edit to the original post ruling out boxing because of the perceived overhead smells of premature optimisation to me. Of course, it’s their project. I’ll be quiet now.


#13

Ah ok, no worries then - I thought you were somehow addressing my reply and I was confused because “why not box” was discussed earlier.

There’s elaboration later in the thread on why he wants to avoid boxing. IMO, it’s totally the right thing to pursue given the background and description of what he’s doing. I don’t think it’s premature, certainly not to explore other/better alternatives. And boxing this type and amount of data in Java will kill performance there too :slight_smile:

No need to be quiet, of course - but let’s try to have a productive brainstorm.


#14

Since you mention interning, I’ll just plug my internment crate, which makes interning as easy as boxing a value. :slight_smile:


#15

Good plug! :slight_smile:

I’m not sure if @jasonasher will actually have a need for it because he said it’s primarily primitives being stored. But, it’s a possible impl consideration for any stray types that are larger than a ptr so I threw it out there.


#16

This looks very promising, and I really appreciate your help. I’ll give it a shot and let you know what I come up with!


#17

I am not aiming to avoid all boxing, I just want to avoid boxing every individual primitive value in a vector when this will occur billions of times. In the Java version I have to employ the same approach - using multiple arrays of primitives instead of hundreds of millions of individual agent objects. But, if I’m missing something I welcome your input.