HashMap performance

This issue is about optimizing the performance of counting the occurrences of instances of a collection of u32 values.
Using the Rust standard library, such feature may be implemented as a map to a counter. The Rust standard library has two kinds of maps: std::collections::HashMap and std::collections::BTreeMap.

Using the first one, I wrote the following Rust program:

fn main() {
    let mut m = std::collections::HashMap::<u32, u32>::new();
    for i in 0..200_000_000 {
        *m.entry(i % 100_000).or_insert(0) += 1;
    }
    println!("{}", m[&40000]);
}

It prints β€œ2000”.
If β€œHashMap” is replaced by β€œBTreeMap”, an equivalent program is obtained.
Then I wrote the following equivalent C++ program, that uses also a GCC-only library extension, contained in the "__gnu_pbds" namespace:

#include <iostream>
#include <ctime>
#include <map>
#include <unordered_map>
#include <ext/pb_ds/assoc_container.hpp>
using namespace std;

template<typename T>
void f() {
    clock_t start = clock(); 
    T m;
    for (int i = 0; i < 200000000; ++i) {
        auto j = i % 100000;
        auto pos = m.find(j);
        if (pos == m.end()) m[j] = 1;
        else ++pos->second;
    }
    cout << m[40000] << " " << float(clock() - start)
        / CLOCKS_PER_SEC << " s" << endl;    
}
    
int main() {
    f<map<unsigned int, unsigned int>>();
    f<unordered_map<unsigned int, unsigned int>>();
    f<__gnu_pbds::cc_hash_table<unsigned int, unsigned int>>();
}

The running times that I got in one computer are the following ones:

  • std::collections::HashMap: 8.9 seconds;
  • std::collections::BTreeMap: 12.6 seconds;
  • std::map: 20.2 seconds;
  • std::unordered_map: 2.61 seconds;
  • __gnu_pbds::cc_hash_table: 0.83 seconds.

It appears that the Rust program using BTreeMap outperforms the C++ program using map (that uses a red-black tree), but it appears also that the fastest Rust program, that is the one that uses HashMap, takes more than 3 times the time taken by the program using unordered_map, and more than 10 times the time taken by the program using cc_hash_table.
To demonstrate that it is not a defect in the test program but an inefficiency of the library, I encapsulated the C++ collections in a C++ library, and then called such library from a Rust program using FFI.
Here is the C++ source code of the library:

#include <unordered_map>
#include <ext/pb_ds/assoc_container.hpp>
using namespace std;
static unordered_map<unsigned int, unsigned int> um;
static __gnu_pbds::cc_hash_table<unsigned int, unsigned int> ccht;
extern "C" void um_create() { }
extern "C" void ccht_create() { }
extern "C" void um_destroy() { um.clear(); }
extern "C" void ccht_destroy() { ccht.clear(); }
extern "C" void um_increment_count(unsigned int i) {
    try {
        auto pos = um.find(i);
        if (pos == um.end()) um[i] = 1;
        else ++pos->second;
    }
    catch (...) { }
}
extern "C" void ccht_increment_count(unsigned int i) {
    try {
        auto pos = ccht.find(i);
        if (pos == ccht.end()) ccht[i] = 1;
        else ++pos->second;
    }
    catch (...) { }
}
extern "C" unsigned int um_get_count(unsigned int i) {
    try {
        return um[i];
    }
    catch (...) { }
    return 0;
}
extern "C" unsigned int ccht_get_count(unsigned int i) {
    try {
        return ccht[i];
    }
    catch (...) { }
    return 0;
}

Here is the Rust source code of the program using such library:

use std::time::Instant;
#[link(name = "stdc++", kind = "static")]
#[link(name = "hash", kind = "static")]
extern {
    fn um_create();
    fn ccht_create();
    fn um_destroy();
    fn ccht_destroy();
    fn um_increment_count(i: i32);
    fn ccht_increment_count(i: i32);
    fn um_get_count(i: u32) -> i32;
    fn ccht_get_count(i: u32) -> i32;
}

fn main() {
    let t0 = Instant::now();
    unsafe {
        um_create();
        for i in 0..200_000_000 {
            um_increment_count(i % 100_000);
        }
        println!("{}", um_get_count(40000));
        um_destroy();
    }
    let d1 = t0.elapsed();
    println!("{}", d1.as_secs() as f64 + d1.subsec_nanos() as f64 / 1e9);

    let t1 = Instant::now();
    unsafe {
        ccht_create();
        for i in 0..200_000_000 {
            ccht_increment_count(i % 100_000);
        }
        println!("{}", ccht_get_count(40000));
        ccht_destroy();
    }
    let d2 = t1.elapsed();
    println!("{}", d2.as_secs() as f64 + d2.subsec_nanos() as f64 / 1e9);
}

And here is the Bash script to build the library and the program, and to run the generated program:

g++ -std=c++11 -fPIC -c -O3 hash2.cpp -o hash2.o && \
ar rcs libhash.a hash2.o && \
rustc -O -L. hash2.rs && \
./hash2

The resulting times are:

  • std::unordered_map: 2.69 seconds;
  • __gnu_pbds::cc_hash_table: 0.83 seconds.

The resulting times are the same of the C++-only program.
I don’t know why the HashMap collection of the Rust standard library is performing so poorly, compared to map, and why map is performing so poorly, compared to cc_hash_table, but I think these differences such be seriously taken into account to improve the performance of the HashMap collection, or to provide an alternative Rust collection.

8 Likes

My understanding is that by default Rust uses a pretty slow hash function which provides good protection from DOS attacks. More about it in the FAQ: https://www.rust-lang.org/en-US/faq.html#why-are-rusts-hashmaps-slow.

It would be interesting to know how ::std::collections::HashMap and unordered_map compare for the same hash function.

3 Likes

And I'd like to add that BTreeMap is faster then map mainly because STL requirements for map forbid to implement it as a tree with a large branching factor. So we can't compare Rust and C++ here, because they have different APIs. And while we are at it, here is an awesome long read about Rust BTreeMap: Rust Collections Case Study: BTreeMap :slight_smile:

4 Likes

Using a safe hashing on default is acceptable, but Rust std library needs to offer a really simple to use and compact way to say "don't use the safe hashing in this case".

Do benchmark FnvHasher and see if it's fast enough for your use cases, but there's more to hash table performance than just the hash function. Your hash table is 2MB (131,072 slots (the smallest power of 2 greater than 100k * 11 / 10) * (8 byte hash value + 4 byte key + 4 byte data), and Rust's hash tables lay out the hashes, the keys, and the data in 3 separate subarrays so every lookup requires a minimum of 3 L2 cache misses (assuming your L2 cache is smaller than 2MB).

2 Likes

You can already swap out the hashing algorithm rather easily. Do you mean you think there should be a second hasher in the standard library?

@carlomilanesi what's your rust version?

There is a recent improvement of hashing:

And there is another significant improvement that's possible:

After those two changes Rust hashing could be "fast enough". But when I have tried it, swapping out hashing strategies in Rust has being rather laborious, badly documented and not easy for me (and I was trying to hash just a tuple of two f64, I think).

2 Likes

Here're are my results with latest nightly. Simple in this case is the same strategy used by unordered_map, just return the integer. We're actually faster.

Code: hash.rs Β· GitHub

2000
SIP13 Duration { secs: 6, nanos: 709803720 }
2000
SIP24 Duration { secs: 7, nanos: 654518957 }
2000
FNV Duration { secs: 5, nanos: 179056737 }
2000
Simple Duration { secs: 0, nanos: 897325662 }
➜  temp ./a.out 
2000 map 18.505 s
2000 unordered 1.9361 s
1 Like

It seems like you mean you were trying to implement a special hasher? This is a rather different problem from trying to swap out the hashing algorithm in a hash map.

If you have a hashing algorithm implemented using the interface defined in std, all you have to do to use it with HashMap is construct the map with the HashMap::with_hasher constructor.

1 Like

C++ does even more trickery with it's templates and for integers it won't even store the hash, making it more concise in memory.

The final purpose was to have a fast hash map where the keys were a pair of f64 values. So I needed to do both things.

As a side note, rustc -O corresponds to rustc -C opt-level=2. It likely won't have much of an impact on this code, but you may want to experiment with -C opt-level=3.

Just out of curiousity, what was the challenge about defining an implementation of the Hasher trait?

Take a look at the "simple" module code in the link above:

That's very bad, and that's the basic version. If you want a smarter hashing and you have some f64s, it gets worse.

I don't see what is so bad about that. You have to define default, write, and finish. These seem like the basic building blocks of a hasher, I don't see how they could be made simpler. What practical improvement to this API would you make?

I am not able to use that API, it's too much complex and hard to use for me, it's too much badly documented, and I think it requires too much user code.

This feedback isn't very easy to put into action. :frowning: What user code do you think is required, but superfluous?

The simple hasher in your example has code to convert a byte array into a u64; I think that should be a function in the standard library, but its not really related to hashing per se. This doesn't apply to your example of converting a (f64, f64) tuple for example.

Yes, let's drop this discussion, I'll just keep using the default hashing for now.

But in the Rust docs I'd like (much) better documentation of how to write code like that, with few complete long examples. How to swap out the DOS-attacks-resistant way with something terribly DOS-unsafe but with reasonable performance, how to hash all the built in types (including the floating points ones), and so on.

4 Likes

I sympathize with you that HashMap documentation is a bit light on the default hashing algorithm used. It would be nice if it pointed to some alternative Hasher implementations, like rust-fnv. Though it probably shouldn't focus on how to implement Hasher.

1 Like