Is there a way of limiting how much a HashMap or BTreeMap can grow due to repetitive calls to insert?
I'm implementing a sparse histogram based on BTreeMap that will run on a device with about 1 GiB of available memory, and I'd like to avoid that a bad configuration or unexpected data cause the histogram to grow too much and hard crash the device.
I've thought of using new_in and provide a custom allocator that would not allocate more than 1 GiB, but that would require nightly. And if I understand correctly all the allocator can do is return AllocError when the BTreeMap tries to grow, at which point insert will panic and crash the device anyway.
I've also thought of checking the len of the map before every insert and returning an error if it reaches some threshold, but if possible I'd like to avoid unnecessary checks in a performance-critical part of the code. Ideally I'd like to check only when the map tries to grow.
insert is already doing a len < capacity check to know whether it needs to grow, so you doing the same check should have negligible performance impact.
For HashMap, capacity and try_reserve can give you the fallible allocation on stable. You'll have to check occasionally, naturally. But if you know how much capacity you need before entering some hot loop, say, you could do the check then.
You could make a HashMapExt trait to give yourself checked_insert or whatever to make it more ergonomic.
If you have a known upper limit of memory you allow your Rust process to use, you can make that a hard limit with this:
This may be better than leaving it up to the operating system, because swap and various virtual memory tricks may mean your process won't actually run out of memory before causing other problems.
But how are you planning to limit the size without comparing it to the limit? That is pretty much unavoidable, and the custom allocator would have to do the same thing at some point, too – at most you wouldn't "feel" that it's being done directly, but that check must eventually be there, since it's exactly what you are trying to achieve.
The performance implication would very likely be negligible anyway. It's an integer comparison followed by a very predictible branch. A lot of things happen in a hash map insertion which could easily dominate such a trivial operation. Computation of the hash, probing the candidate slots of the map, or allocating memory if it needs to grow (and copying over the entire map).
I wouldn't worry about the speed of the check. You could probably wrap the hash map in a newtype and provide a try_insert() method to ensure that unchecked HashMap::insert() is never called.
My first try after reading your answers what to just check the len of the HashMap before every call to insert. Benchmarking with criterion showed that the "capped" version of my histogram was about 5% slower than the uncapped version. Which is maybe not bad.
I then switched to performing the check only if entry returned Entry::Vacant, and then there was indeed no significant difference!
For the record, here's a rather simplified version of the code I ended up with:
use std::collections::hash_map::{Entry, HashMap};
pub struct CappedSparseHist {
buf: HashMap<i64, u32>,
cap: usize,
}
impl CappedSparseHist {
pub fn increment(&mut self, value: i64) -> bool {
let len = self.buf.len();
match self.buf.entry(value) {
Entry::Occupied(mut entry) => {
*entry.get_mut() += 1;
true
}
Entry::Vacant(entry) => {
if len == self.cap {
false
} else {
entry.insert(1);
true
}
}
}
}
}
I'm still undecided whether increment should return a bool, a Result or an Option... I'll see how everything fits together in the final application.
That is interesting, thanks for the tip! I don't think it fits my current problem very well as it would still crash the program if HashMap tries to allocate too much memory. But I think I have use for it in other cases.
As an aside, at first I thought I wanted to use a BTreeMap because in the end I want the bins of the histogram (the keys) to be sorted. However, inserting into a BTreeMap turned out to be significantly slower than inserting into a HashMap! That probably makes a lot of sense, but the extent of the difference came to me as a surprise. So I'm now using HashMap and sorting the result in the end when I need it (and when performance is not so critical anymore).
I might also experiment with replacing HashMap with IntMap since my keys are i64s.
When your keys are integers up to a certain point, you could also consider using a Vec where the index in the key. You could allocate a Vec of the size you need and pass it around as &mut [i64], which allows mutation but not changing the length.
I do use a Vec<u32> for a contiguous histogram, and it's really fast. By contiguous I mean that given a bin size I have counters for all possible bins in a certain range.
In this case, however, I want a sparse histogram where I only allocate counters for non-empty bins. It's slower, but much more memory efficient when I know data will be clumped in one or more small regions (so I need small bin size) but I don't know where the regions are going to be centered. A sparse implementation is also useful when the histogram is multidimensional (I need up to 4D), as allocating a Vec of N^4 mostly-empty bins gets quickly out of hand on an embedded device.