I am reading/streaming a large csv input file and want to count to frequency of each word in it. To do that, my idea is to use a hashmap mapping X
to its frequency u32
, where X
is either a) &str
or b) String
. I would like this to be as efficient and fast as possible.
If go with a) &str
, I get a lifetime problem because the hashmap outlives the borrowed string (which was tied to the lifetime of the buffer that was just read in).
let frequency: HashMap<&str, u32> = HashMap::new();
for word in word_iterator { // word is a &str
frequency.entry(word).or_insert(0) += 1; // word does not live long enough
}
println!("{:?}", frequency);
Ideally, I would like to do the following. If frequency already contains the particular key word
, just increase the count. If not, somehow create a new String
by copying word
, but assigning it the lifetime of frequency
, so that we can place its borrowed value into frequency. Is that possible?
If I go with b) String
, I get loads of unnecessary copies using the entry API
let frequency: HashMap<String, u32> = HashMap::new();
for word in word_iterator { // word is a &str
frequency.entry(word.to_string()).or_insert(0) += 1; // every word is copied
}
println!("{:?}", frequency);
or I perform two searches using match (this does not even pass the borrow checker)
let frequency: HashMap<String, u32> = HashMap::new();
for word in word_iterator { // word is a &str
match frequency.get_mut(word) { // first search
Some(v) => *v += 1,
None => word_map.insert(word.to_string(), 1), // second search
}
}
println!("{:?}", frequency);
I would think that this solution is the best out of the three available, if it passed the borrow checker. But not only does it not, it is not optimal from a performance standpoint, either.
So instead I tried combining the approaches. String
is great because frequency
can claim full ownership over its keys and &str
is great because we can do fast compares/lookups. What both have in common is the fact that you can hash both and get the same result. My idea is to have the u64
hash be the key and the tuple (String, u32)
be the value.
let frequency: HashMap<u64, (String, u32)> = HashMap::new();
for word in word_iterator { // word is a &str
frequency.entry(hash(&word)).or_insert((word.to_string(), 0)).1 += 1;
}
println!("{:?}", frequency);
...
fn hash<T: Hash>(t: &T) -> u64 {
let mut s = SipHasher::new();
t.hash(&mut s);
s.finish()
}
This seems to have all of the advantages, but now I worry about two things. First of all, complexity. We need to remember to use .0
and .1
instead of meaningful names. That sounds like a recipe for disaster when revisiting the code in a few months. Secondly, this does not take care of hash collisions in any way, so it might not actually produce the correct result.
Have any of you run into similar challenges? Any ideas or input on which is the fastest/most efficient solution, while being correct and maintainable?