Hi everyone!
I am working on a first program that as part of its job -- as far as I've got -- produces a vector of hashmaps ( Vec<HashMap<&[u8], usize>>
), which I call hash_vec
in the code sample below.
The program then does the job of merging these hashmaps into a single hashmap and returns an output of those pairs.
I want to find the best way to merge the hashmaps in this vector, making a single hashmap that contains all the unique keys from all the hashmaps I start with, while adding together their values. Other questions and answers, such as this one, have helped me make the program work by doing this:
// this compiles and does the job
hash_vec.into_iter().for_each(|hashmap| {
for (key, value) in hashmap {
if final_hash.contains_key(key) {
final_hash.insert(key, value + final_hash[key]);
} else {
final_hash.insert(key, value);
}
}
});
I've learned that I can try rustc-hash
or fnv
(crates) for faster hashers, and that this would speed my current merging solution up a bit:
hash_map.entry(key).and_modify(|v| *v += value).or_insert(value);
In an earlier question I posted about this project helpful people commented on the IO-boundedness of what I'm trying to do. I still haven't learned whatever it is I need to know in order to fully appreciate what these knowledgable folks are pointing out, I think.
Now I'm trying to think about how to speed this up, given a typical data sample in my case produces a vector with over 100 hashmaps. My best idea so far is to iterate through my vector of hashmaps in chunks of 2, and merge the hashmaps in a kind of merge sort algorithm that finally produces a single hashmap of all the unique keys and the sum of their appearance in the data.
But is there not some way to do this merging somehow on the fly instead of collecting the hashmaps into a vector in the first place?
The whole project at its current stage is here. Thanks for any help!