Efficient String HashMaps for a frequency count


#1

I am reading/streaming a large csv input file and want to count to frequency of each word in it. To do that, my idea is to use a hashmap mapping X to its frequency u32, where X is either a) &str or b) String. I would like this to be as efficient and fast as possible.

If go with a) &str, I get a lifetime problem because the hashmap outlives the borrowed string (which was tied to the lifetime of the buffer that was just read in).

let frequency: HashMap<&str, u32> = HashMap::new();
for word in word_iterator { // word is a &str
    frequency.entry(word).or_insert(0) += 1; // word does not live long enough
}
println!("{:?}", frequency);

Ideally, I would like to do the following. If frequency already contains the particular key word, just increase the count. If not, somehow create a new String by copying word, but assigning it the lifetime of frequency, so that we can place its borrowed value into frequency. Is that possible?

If I go with b) String, I get loads of unnecessary copies using the entry API

let frequency: HashMap<String, u32> = HashMap::new();
for word in word_iterator { // word is a &str
    frequency.entry(word.to_string()).or_insert(0) += 1; // every word is copied
}
println!("{:?}", frequency);

or I perform two searches using match (this does not even pass the borrow checker)

let frequency: HashMap<String, u32> = HashMap::new();
for word in word_iterator { // word is a &str
    match frequency.get_mut(word) { // first search
        Some(v) => *v += 1,
        None => word_map.insert(word.to_string(), 1), // second search
    }
}
println!("{:?}", frequency);

I would think that this solution is the best out of the three available, if it passed the borrow checker. But not only does it not, it is not optimal from a performance standpoint, either.

So instead I tried combining the approaches. String is great because frequency can claim full ownership over its keys and &str is great because we can do fast compares/lookups. What both have in common is the fact that you can hash both and get the same result. My idea is to have the u64 hash be the key and the tuple (String, u32) be the value.

let frequency: HashMap<u64, (String, u32)> = HashMap::new();
for word in word_iterator { // word is a &str
    frequency.entry(hash(&word)).or_insert((word.to_string(), 0)).1 += 1;
}
println!("{:?}", frequency);

...
fn hash<T: Hash>(t: &T) -> u64 {
  let mut s = SipHasher::new();
  t.hash(&mut s);
  s.finish()
}

This seems to have all of the advantages, but now I worry about two things. First of all, complexity. We need to remember to use .0 and .1 instead of meaningful names. That sounds like a recipe for disaster when revisiting the code in a few months. Secondly, this does not take care of hash collisions in any way, so it might not actually produce the correct result.

Have any of you run into similar challenges? Any ideas or input on which is the fastest/most efficient solution, while being correct and maintainable?


#2

It’s not possible to ‘assign a lifetime’ to a value at all - the string is deallocated when its deallocated, and you can’t extend the lifetime further. Lifetimes are descriptive, not prescriptive.

However, I notice that the code snippets you have posted aren’t complete. It seems surprising to me that the strings yielded by the word_iterator do not live for as long as the hashmap. Surely the strings being iterated over were allocated prior to your hashmap being created.

I suspect that with a bit more context, we could solve that problem, so that you could calculate your frequency using the HashMap<&str, u32>, as would be ideal. Here’s a playpen example to demonstrate that this could work:

use std::collections::HashMap;

fn main() {
    let words = vec![String::from("alpha"), String::from("beta"), String::from("delta"), String::from("alpha")];

    let mut frequency: HashMap<&str, u32> = HashMap::new();
    for word in &words { // word is a &str
        *frequency.entry(word).or_insert(0) += 1;
    }
    println!("{:?}", frequency);
}

#3

What if you’re just loading the hashmap from a file for later processing?


#4

Is this what you mean? You have a function with a signature like this?

fn frequencies(file: File) -> HashMap<&str, u32>

So the file is read and closed during the execution of that function, and you want to return a HashMap of the frequencies.

In that case, you do need to return a HashMap<String, u32>, because the HashMap should have ownership of the String data that is allocated during the course of that function call.

But in that case, the issue is that an Iterator that yields &strs is not the right iterator for this use case, you should be using an Iterator that yields Strings.

An alternative approach, if the file isn’t too large for holding in memory, is to read the whole file into a single buffer & then run all processing on the file while that buffer is in scope, so that all the uses of the file’s data can just be slices into that buffer.


#5

The point that @bounce brought up in his original message is that while String seems to be right thing type here, i.e.

fn frequencies(file: File) -> HashMap<String, u32>

the problem is that he has to either

  • check whether a &str is already in the hash map after allocating a String (frequency.entry(word.toString())), which is a wasteful allocation when the string is already in the HashMap; or
  • perform two searches (frequency.getMut(word) followed by a conditional insert with word.toString()), which is wasteful computationally.

What HashMap seems to be missing is

fn entry_efficient<Q: ?Sized>(&mut self, key: &Q) -> Entry<K, V>
    where
        K: Borrow<Q>,
        Q: Hash + Eq + Into<K>

where, with Q = &str and K = String, it uses a &str parameter to look for the entry and only if the entry is missing, it allocates a String with (&str)::into() and insert into the HashMap.


#6

I understand that. My point is that you have an Iterator<Item = String>, you can use the existing entry API, which will be more efficient than what you are suggesting because you’ll never create a new string as part of performing this frequency check.

For example, the lines iterator on a BufReader yields Result<String>, not Result<&str>; any iterator over that file which yields string slices is maladapted to this purpose.


#7

That’s true about file input — it’s probably the wrong example. Consider the &strs are borrowed from some other temporary data structure then. :wink:


#8

I think you should restructure your code so that the strings exist as long the hashmap.

I don’t know if the API you’re proposing would be a good idea / would be accepted, but I do think this is evolutionary pressure from the compiler to refactor your code to avoid what should be unnecessary allocations. My experience, after getting over the learning curve and grokked the error messages, is that the borrow checker is a great friend for advising me how to avoid allocations.

(Probably the correct signature here is Q: ToOwned<Owned = K>, not Q: Into<K>, by the way).


#9

FYI, there’s actually an RFC under discussion with a quite similar proposal that was initiated two weeks ago.


#10

That RFC sounds like exactly what I’d need.

Until then, what might be the best alternative? Am I right in assuming that my final, hash-based idea is prone to hash-collisions? How could one fix the match based approach that is currently not allowed by the borrow-checker?

To clarify, the iterator comes from BufReader, which in turns reads from a file too large to fit into memory. So we only get &str to work with, causing this cascade of issues.


#11

This cannot possibly work.
The slices returned by BufReader point into an internal buffer. That buffer is reused and thus the slices would soon be invalidated.
If you keep the slice, you will also keep a borrow on the BufReader which prevents you from reading any further.
There’s a reason why the iterator returned BufReader::lines() yields String not &str.


#12

I doubt it can be done; after all, surely this would be the very reason why the Entry api was added in the first place! (So of course, it is a shame that it suffers from this critical design flaw…)

Edit: Whoops, I was hasty with the above. I somehow thought you were attempting to use get_mut in a way that would only cause a single search! (as for why I would expect that specific concern to inevitably cause a run-in the borrow-checker… in hindsight, I have no idea!)


As I see it, you currently have no choice but to settle for something less than optimal. Your best friend at a time like this is cargo bench.


#13

@bounce In the meantime, you do need to construct a HashMap<String, u32>, if your file is too large to hold in memory. You can do this in a couple ways and benchmark the results:

  • Have the iterator yield String and use the entry API.
  • Have the iterator yield &str and to_owned them if they’re not present, performing 2 lookups. You can perform the double lookup, but its somewhat ugly:
let has_word = match table.get_mut(word) {
    Some(count) => { *count += 1; true }
    None => false,
};
if !has_word {
    table.insert(word.to_owned(), 1);
}

Not only can it be done, it is somewhat being done (or shortly will be). The concept is called “non-lexical lifetimes,” and its an improvement to the borrow checker that was very difficult to perform with how the compiler was written around 1.0. One of the main goals of the refactor to introduce MIR was to enable the borrow checker to accept examples like this.


#14

A slightly more idiomatic version (via Option's helpers) would be:

table.get_mut(word)
    .map(|count| { *count += 1; })
    .unwrap_or_else(|| { table.insert(word.to_owned(), 1); });

#15

That’s a great solution until the RFC is accepted and implemented. Thank you!
Last nit pick, I read somewhere that now that .to_owned() and .to_string() do the same thing on &str, it is more idiomatic/readable to use .to_string(). Any opinions on that?


#16

There are 4 reasonable ways to convert a str to a String (there are more ways that are less reasonable, usually):

  • str.to_string()
  • str.to_owned()
  • str.into()
  • String::from(str)

It doesn’t really matter which you use, but which one I would use depends on context. For example, I always convert string literals to Strings using String::from, because it reads like a constructor.

In this context, I think to_owned makes the most sense, because the reason you’re performing the conversion is that you need an owned type. Your mileage may vary though, and I could see the argument for to_string here, or really any of these.


#17

What about using a HashMap<Cow<str>, u32>?


#18

I don’t thinks that helps here.
The problem ist the lifetime annotation, not cloning.
Cow has still the same lifetime restrictions as a plain reference.

What you’d need is some kind of weak reference that removes the lifetime restrictions and is converted to owned when the borrowed value goes out of scope. “Clone on Invalid”, not “Clone on Write”. I don’t think this is easily possible and probably not worth the trouble.