Help needed merging multiple hashmaps while updating/summing numerical values

Hi everyone!

I am working on a first program that as part of its job -- as far as I've got -- produces a vector of hashmaps ( Vec<HashMap<&[u8], usize>> ), which I call hash_vec in the code sample below.

The program then does the job of merging these hashmaps into a single hashmap and returns an output of those pairs.

I want to find the best way to merge the hashmaps in this vector, making a single hashmap that contains all the unique keys from all the hashmaps I start with, while adding together their values. Other questions and answers, such as this one, have helped me make the program work by doing this:

// this compiles and does the job

hash_vec.into_iter().for_each(|hashmap| {
    for (key, value) in hashmap {
        if final_hash.contains_key(key) {
            final_hash.insert(key, value + final_hash[key]);
        } else {
            final_hash.insert(key, value);
        }
    }
});

I've learned that I can try rustc-hash or fnv (crates) for faster hashers, and that this would speed my current merging solution up a bit:

hash_map.entry(key).and_modify(|v| *v += value).or_insert(value);

In an earlier question I posted about this project helpful people commented on the IO-boundedness of what I'm trying to do. I still haven't learned whatever it is I need to know in order to fully appreciate what these knowledgable folks are pointing out, I think.

Now I'm trying to think about how to speed this up, given a typical data sample in my case produces a vector with over 100 hashmaps. My best idea so far is to iterate through my vector of hashmaps in chunks of 2, and merge the hashmaps in a kind of merge sort algorithm that finally produces a single hashmap of all the unique keys and the sum of their appearance in the data.

But is there not some way to do this merging somehow on the fly instead of collecting the hashmaps into a vector in the first place?

The whole project at its current stage is here. Thanks for any help!

A couple of things to try

Where you use par_iter(), instead of using map try using fold then reduce. With rayon, fold will let you merge the data into HashMaps in parallel. Then reduce will take those maps and let you merge them into a single one.

If you want to avoid the Vec altogether, you should be able to call par_bridge directly on the records() result instead of calling collect (then call fold and reduce). par_bridge creates a parallel iterator from a regular iterator.

1 Like

Thanks very much for your suggestions.

I just looked at what @drewkett suggested after doing quite a bit of trying to make fold and reduce work separately because I didn't grasp what you were saying at first I now realize. NOW I get an idea of first using fold, then using reduce.

The code below compiles and seems to be doing the job. I'd love to get any comments on how I'm using fold and reduce here because I'm only just getting how I did this. The big challenge for someone relatively new to coding like me is figuring out documentation for oneself.

Thanks again to @drewkett for their helpful suggestions!

I came up with a mini version of what I'm trying to do (there's pretty bad style in here that cargo clippy picks up but it's just rough):

use std::collections::HashMap;
use reduce::Reduce;
use rayon::prelude::*;
// ...
fn main() {
    let h: HashMap<&str, usize> = [("ACATG", 1), ("CATAG", 1), ("AAATT", 1)].iter().cloned().collect();
    let i: HashMap<&str, usize> = [("ACATG", 2), ("CATAG", 2), ("AAATT", 2)].iter().cloned().collect();
    let j: HashMap<&str, usize> = [("ACATG", 3), ("CATAG", 3), ("AAATT", 3)].iter().cloned().collect();
    let k: HashMap<&str, usize> = [("ACATG", 4), ("CATAG", 4), ("AAATT", 4)].iter().cloned().collect();
    
    let mut v: Vec<HashMap<&str, usize>> = Vec::new();

    v.push(h);
    v.push(i);
    v.push(j);
    v.push(k);

    let hbomb = v
	.par_iter()
	.fold(||HashMap::new(), |mut a: HashMap<&str, usize>, b| {
	    eprintln!("initial a: {:?}", &a);
	    eprintln!("initial b: {:?}", &b);
	   
	    a.extend(b.into_iter());
	    
	    eprintln!("a: {:?}", &a);
	    eprintln!("b: {:?}", &b);
	    a
	}).reduce(||HashMap::new(),
		  |mut a, b| {
		      eprintln!("initial red a: {:?}", &a);
		      eprintln!("initial red b: {:?}", &b);
		      for (k, v) in b {
			  if a.contains_key(k) {
			      a.insert(k, v + a[k]);
			  } else {
			      a.insert(k, v);
			  }
		      }
		      eprintln!("final a: {:?}", &a);
		      a
		  });
    
    eprintln!("{:?}", hbomb);
}
// Output:
// initial a: {}
// initial b: {"ACATG": 3, "CATAG": 3, "AAATT": 3}
// a: {"ACATG": 3, "CATAG": 3, "AAATT": 3}
// b: {"ACATG": 3, "CATAG": 3, "AAATT": 3}
// initial red a: {}
// initial red b: {"ACATG": 3, "CATAG": 3, "AAATT": 3}
// final a: {"CATAG": 3, "ACATG": 3, "AAATT": 3}
// initial a: {}
// initial b: {"AAATT": 1, "CATAG": 1, "ACATG": 1}
// a: {"AAATT": 1, "CATAG": 1, "ACATG": 1}
// b: {"AAATT": 1, "CATAG": 1, "ACATG": 1}
// initial red a: {}
// initial red b: {"AAATT": 1, "CATAG": 1, "ACATG": 1}
// final a: {"AAATT": 1, "ACATG": 1, "CATAG": 1}
// initial a: {}
// initial b: {"ACATG": 2, "AAATT": 2, "CATAG": 2}
// initial a: {}
// initial b: {"CATAG": 4, "ACATG": 4, "AAATT": 4}
// a: {"AAATT": 4, "ACATG": 4, "CATAG": 4}
// b: {"CATAG": 4, "ACATG": 4, "AAATT": 4}
// initial red a: {}
// initial red b: {"AAATT": 4, "ACATG": 4, "CATAG": 4}
// final a: {"CATAG": 4, "ACATG": 4, "AAATT": 4}
// initial red a: {"CATAG": 3, "ACATG": 3, "AAATT": 3}
// initial red b: {"CATAG": 4, "ACATG": 4, "AAATT": 4}
// final a: {"CATAG": 7, "ACATG": 7, "AAATT": 7}
// a: {"AAATT": 2, "CATAG": 2, "ACATG": 2}
// b: {"ACATG": 2, "AAATT": 2, "CATAG": 2}
// initial red a: {}
// initial red b: {"AAATT": 2, "CATAG": 2, "ACATG": 2}
// final a: {"AAATT": 2, "CATAG": 2, "ACATG": 2}
// initial red a: {"AAATT": 1, "ACATG": 1, "CATAG": 1}
// initial red b: {"AAATT": 2, "CATAG": 2, "ACATG": 2}
// final a: {"AAATT": 3, "ACATG": 3, "CATAG": 3}
// initial red a: {"AAATT": 3, "ACATG": 3, "CATAG": 3}
// initial red b: {"CATAG": 7, "ACATG": 7, "AAATT": 7}
// final a: {"AAATT": 10, "ACATG": 10, "CATAG": 10}
// {"AAATT": 10, "ACATG": 10, "CATAG": 10}   This is what I wanted! Just not sure if I know how I'm getting here... 

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.