Non-blocking Hash Table in Rust


#1

Hi everyone,

I’ve been working on the side on a new hash table design and recently posted the rust implementation. As far as I know, this provides best-in-class performance (in any language) for both read-heavy and write-heavy workloads, though you can probably do better in read-only scenarios. It is a chaining hash table, and it supports resizing where resizing doesn’t block readers or writers.

Some Links:

  • The code (see src/sync/hash_table.rs).
  • The writeup focuses on the algorithm and less on the Rust specifics, but the psuedo-code is pretty similar to the Rust implementation.

I don’t think this code is production ready yet, as:

  • The code (currently a crossbeam fork) should be cleaned up.
  • There are some low-level optimizations (see Appendix D of the writeup) that will improve single-threaded performance, which isn’t great right now.
  • To my knowledge, this design is new. There should be more work to verify its correctness.

I’m posting here because I think some of y’all will find this interesting, and I’d love some feedback!


Concurrent Persistent Hashmap
#2

In Stamp::delete, why don’t you use an atomic or operation?


#3

I should probably be using that an atomic or. I don’t think fetch_or showed up in my vim autocompletion when I was writing that method, so I did the load/store thing instead. Thanks for catching that!


#4

This looks very promising, great work!

Did you perhaps compare performance to junction, TBB, and Folly? I’d be very interested to see how it fares against those hash table implementations.


#5

I just took a look at junction. Junction’s leepfrog algorithm is related to the hopscotch hashing table I used in the performance comparisons for the writeup, so I would expect that the grampa and leepfrog to be roughly in the same ballpark as the hopscotch measurements, potentially doing better.

I spent the last hour trying to run the junction benchmarks on my own machine, and all my runs show those tables not scaling past 5Mops/second, which is much lower than the numbers in the post you linked to. My guess is that my attempts to get cmake to build with optimizations were futile (I am a cmake novice). I’ll try to give the junction benchmarks another go soon.

On a slightly more philosophical note: my hash table claims a stronger set of progress guarantees than the ones you linked to above. Those provide a range from blocking to lock-free, whereas I believe mine is wait-free (at least on x86 where there is an atomic fetch-add instruction).
.


#6

warning: interpretation of microbenchmarks is an imprecise business.

Following up on this, I was able to reproduce some results for junction's Grampa map. Using the default benchmark settings I got around 160Mops/s at 80% reads and 16 hw threads. That’s a lot better than the numbers for the current version of my hash table. However, if you increase the amount of operations the benchmark performs, you get closer to 40Mops/s, which is worse. I’ll try to drill deeper into the benchmark code to see if I can get closer to an apples-to-apples comparison and update the writeup accordingly.

I’ll try and do the same for folly and tbb as well. Though it appears that folly’s hash table only allows for integer keys and has a limit on the amount that you can grow the table. That’s obviously a very useful data-structure, but it is meaningfully different from supporting arbitrary (hashable) types for keys and arbitrary resizing.