Highly concurrent insert-only hashmap

zeroexcuses · February 24, 2019, 8:10pm

I have a data structure that looks something like this:


pub struct CHMap {
    next_id: u32,
    map: HashMap<String, u32>
}

impl CHMap {
    pub fn get_id(&mut self, s: &str) -> u32 {
        if let Some(idx) = self.map.get(s) {
            return *idx;
        }
        let nid = self.next_id;
        self.next_id = self.next_id + 1;
        self.map.insert(s.clone(), nid);
        return nid
    }
}

It allows me to encode strings to u32s. Don't worry about u32. Assume we have less than 2^32-1 unique strings.
Now, I'm going to be processing ~100GB of text, so I would prefer to run through the text only once. I'm also going to be doing this with 24 threads.
Question: What is the best way to make this CMap concurrent? I don't want all 24 threads fighting over one lock for every string lookup.

I also don't care about th precise assignment of String -> u32 as long as each string gets a different u32, and same string gets same u32.

Intuitively, it seems there should be a "two level" solution. We have a read only "slightly out out of date cache", along with a mutable CMap.

Each thread tires to lookup in the slightly out of date cache first. If it finds it there, it's good. If not, it has to fight over CMap for the lock. Occasionally, CMap updates the "slightly out of date cache."

I'm sure this is a common problem and I'm not the first one to deal with it. What is the standard practice for solving this problem?

zeroexcuses · February 24, 2019, 8:12pm

Each thread only does:

read contents from disk (fast)
conversion f String -> u32 (bottleneck)

so the faster we get this String -> u32, the better

zeroexcuses · February 24, 2019, 8:18pm

Something else I want to stress: we never update the value of any entry. We do lookups. We do inserts for non-existent strings. However, once we assign a u32 to a string, we never change it.. I feel like we should somehow be able to exploit this fact in caching to make things run faster.

kornel · February 24, 2019, 9:19pm

HashMap internally does update buckets on inserts, so it can't be trivially concurrent (something map-like could be concurrent, but not this particular implementation).

You could have one HashMap per thread if you can partition your data or pay cost of N times more lookups.

Have you tried the two primary ways to optimize HashMaps in Rust?

Pre-allocate space with reserve() or with_capacity(), so that you don't pay cost of resizing
Use a faster hasher. The default hash is protecting you against DoS attacks, but if the strings you're inserting are not controlled by an attacker, you don't need that and can use a faster hasher.

zeroexcuses · February 24, 2019, 10:32pm

I'm not convinced we are worrying about the same thing.

I don't think the cost of resizing is a big deal. It's the cost of locknig / unlcoking on every get_id call that seems expensive.
I'm not sure how to test if this is indeed the bottleneck.
Splitting words across threads is unlikely to work, as different threads may have ot process the same word, and both threads need to agree on the same id.

OptimisticPeach · February 24, 2019, 10:54pm

//your other code
let times = Vec::<Duration>::new();
//initialize the parallelism here
//and insert this into it
let start = Instant::now();
let lock = data.lock().unwrap();
let end = Instant::now();
times.lock().unwrap().push(end.duration_since());
//Other code

This won't be 100% accurate as there's another lock in there, but that's about as good as I can come up with on the fly.

kornel · February 24, 2019, 11:26pm

You guess you could roll your own HashMap with a lock per bucket:

Vec<RwLock<Vec<(String, u32)>>>

and to perform a lookup:

map[key.hash() % map.len()].read().unwrap().find(|(k,_)| k == key);

This way lookups would be mostly uncontended, and even multiple reads could coexist.

Topic		Replies	Views
Need a maintained, stable concurrent / lockless hashmap help	5	451	February 12, 2023
Concurrency: Designing a TaintableMap help	5	613	September 3, 2020
Locking only one entry in HashMap help	5	1073	February 22, 2023
Lockfree map example help	3	524	June 16, 2023
Concurrent Persistent Hashmap	7	2557	January 12, 2023

Highly concurrent insert-only hashmap

Related Topics