Fastest StrMap?

When using a HashMap for simple string lookups, e.g. HashMap<String, u32>, is it necessary to change the default hasher implementation in order to gain better performance out of the box?

Let’s say the string keys are usually in the range of like 6-12 characters or so.

Better compared to what? :slight_smile: You’ll only get good comparable data by benchmarking different approaches with your actual workload.

As a first step, check out the hashbrown crate. It uses FxHash by default (works well for small keys) and is generally assumed to be faster than the current implementation in std. In fact, the plan is to make this the new implementation in std. It already has the same API so changing to is most likely just replacing std::collections::HashMap with hashbrown::HashMap.

But, what do you actually want to do? There are a lot of string-specific data structures out there, and depending on your use case a hash map might not be the best choice. For example, fst also gives you a type mapping from strings to integers but makes quite different trade-offs – you need to insert keys in lexicographic order but gain a very efficient, serializable representation as well as performant fuzzy matching.

1 Like

Here’s a few more details:

  • I’m willing to trade slightly longer insertion times, in exchange for faster lookup times.
  • I’m willing to trade a bit more memory consumption, in exchange for faster lookup times.
  • The keys or strings are not predictable - but it’s very unlikely they will be longer than ~32 chars. If it really helps, I can know the max length, before creating the HashMap (but still only at runtime)
  • The values or integers are usually going to be on the smaller side, if necessary I can do some juggling to test for this before the HashMap’s creation. If it helps to know that they are in a predictable order, there are some cases where that’s true (but others where it isn’t)

Does that help?

More specifically - the use-case is basically caching different things about shaders that are known when the shader is compiled and expensive to lookup later (uniform locations, attribute locations, etc.)

On Thursday it would be part of stable

2 Likes

Does that mean that now, (at least on nightly - maybe even stable), for my use case I’m pretty much doing the right thing by just using HashMap as-is?

Beta and current nightly should have hashbrown included.
But, I suppose stdlib HashMap uses DDoS protection,
so its hash function not optimal if you don’t care about DDoS,
then you can use stdlib HashMap with more fast hash function: https://docs.rs/rustc-hash/1.0.1/rustc_hash/type.FxHashMap.html

Plus may be something like https://github.com/rust-analyzer/smol_str may be more suitable for you instead of String (or may be not).

I apologize but I’m still not totally clear on this…

If I use HashMap from std, will it actually be the hashbrown implementation (on nightly) - or must I explicitly opt-in to hashbrown?

If I must explicitly opt-in to hashbrown, do I do that via importing the external crate, or is it now going to be sitting somewhere in the standard library?

Everything on std will use hashbrown, but still with the former DoS-resistant hasher by default. If you want FxHash or similar, you’ll have to opt into that explicitly.

And FxHash is not part of the standard library - rather it’s https://github.com/cbreeden/fxhash ?

There are no alternate hashers in std. I think hashbrown bundles its own internal implementation of the hasher, but AFAIK it’s the same as the one you found.

Hehe I feel funny asking such questions that we’d knock out in like 10 seconds over a drink… but…

Everything on std will use hashbrown
+
I think hashbrown bundles its own internal implementation of the hasher
+
There are no alternate hashers in std

Aren’t we stuck in a loop?

The hashbrown crate provides a fast hash function and uses it by default. But that hash function can be changed, and std configures it HashMap with a different hasher, that provides additional security when hashing untrusted keys at the expense of run-time performance. Does that help?

I mean… I get the concept (as of a few posts ago)… but…

  1. If the hashbrown crate is no longer needed because it’s being folded into std

  2. and std changes the default hasher for hashbrown to be something else (SipHash or whatever)

  3. and I want to use FxHash instead

  4. then if I try to pull it in from hashbrown I hit a block because - see step 1

As naive as this might seem, hope it makes sense why I’m stuck in a loop?

To unravel this though - I think the confusion is because std isn’t actually going to be using hashbrown, rather it is using the algorithm that was originally written for hashbrown - but it’s not actually bringing the crate itself into std.

So the solution, if I want to use FxHash, is actually to still depend on hashbrown, or more specifically hashbrown::FxHash, even though hashbrown::HashMap (and only HashMap - or at least not the extraneous hashers) is now at std::collections::HashMap ?

May be you missed the point that, std::collections::Hash(Map|Set) is generic,
and one parameter of this generic is HashBuilder.
For example https://docs.rs/rustc-hash/1.0.1/rustc_hash/index.html is crate that uses std collections,
it just set another HashBuilder by default.

No, see libstd uses hashbrown as crate and reexport it in libstd

Just use crate like https://docs.rs/rustc-hash/1.0.1/rustc_hash/index.html

ok… so in terms of a practical solution - I should probably be using FxHashMap ?

Yes.
Or may be if in your particular case you can invent better hash function you can just insert it as the third parameter into std::collections::HashMap.

2 Likes