Highly concurrent insert-only hashmap

  1. I have a data structure that looks something like this:

pub struct CHMap {
    next_id: u32,
    map: HashMap<String, u32>
}

impl CHMap {
    pub fn get_id(&mut self, s: &str) -> u32 {
        if let Some(idx) = self.map.get(s) {
            return *idx;
        }
        let nid = self.next_id;
        self.next_id = self.next_id + 1;
        self.map.insert(s.clone(), nid);
        return nid
    }
}
  1. It allows me to encode strings to u32s. Don't worry about u32. Assume we have less than 2^32-1 unique strings.

  2. Now, I'm going to be processing ~100GB of text, so I would prefer to run through the text only once. I'm also going to be doing this with 24 threads.

  3. Question: What is the best way to make this CMap concurrent? I don't want all 24 threads fighting over one lock for every string lookup.

I also don't care about th precise assignment of String -> u32 as long as each string gets a different u32, and same string gets same u32.

Intuitively, it seems there should be a "two level" solution. We have a read only "slightly out out of date cache", along with a mutable CMap.

Each thread tires to lookup in the slightly out of date cache first. If it finds it there, it's good. If not, it has to fight over CMap for the lock. Occasionally, CMap updates the "slightly out of date cache."

I'm sure this is a common problem and I'm not the first one to deal with it. What is the standard practice for solving this problem?

Each thread only does:

  1. read contents from disk (fast)
  2. conversion f String -> u32 (bottleneck)

so the faster we get this String -> u32, the better

Something else I want to stress: we never update the value of any entry. We do lookups. We do inserts for non-existent strings. However, once we assign a u32 to a string, we never change it.. I feel like we should somehow be able to exploit this fact in caching to make things run faster.

HashMap internally does update buckets on inserts, so it can't be trivially concurrent (something map-like could be concurrent, but not this particular implementation).

You could have one HashMap per thread if you can partition your data or pay cost of N times more lookups.

Have you tried the two primary ways to optimize HashMaps in Rust?

  1. Pre-allocate space with reserve() or with_capacity(), so that you don't pay cost of resizing

  2. Use a faster hasher. The default hash is protecting you against DoS attacks, but if the strings you're inserting are not controlled by an attacker, you don't need that and can use a faster hasher.

I'm not convinced we are worrying about the same thing.

  1. I don't think the cost of resizing is a big deal. It's the cost of locknig / unlcoking on every get_id call that seems expensive.

  2. I'm not sure how to test if this is indeed the bottleneck.

  3. Splitting words across threads is unlikely to work, as different threads may have ot process the same word, and both threads need to agree on the same id.

//your other code
let times = Vec::<Duration>::new();
//initialize the parallelism here
//and insert this into it
let start = Instant::now();
let lock = data.lock().unwrap();
let end = Instant::now();
times.lock().unwrap().push(end.duration_since());
//Other code

This won't be 100% accurate as there's another lock in there, but that's about as good as I can come up with on the fly.

You guess you could roll your own HashMap with a lock per bucket:

Vec<RwLock<Vec<(String, u32)>>>

and to perform a lookup:

map[key.hash() % map.len()].read().unwrap().find(|(k,_)| k == key);

This way lookups would be mostly uncontended, and even multiple reads could coexist.