New string interning crate: `symbol_table`

mwillsey · May 6, 2022, 9:00pm

I made a new string interning crate: symbol_table.

It includes 3 features that I've needed in a couple situtations:

It (optionally) includes a built-in global symbol table, so you can just say GlobalSymbol::from("foobar") to intern things into the global table.
It's well-suited to concurrent access; a symbol table is sharded to reduce lock contention. In my limited benchmarking, it's the fastest string interner under medium/high contention, and reasonably fast in no/low contention.
It hands out stable &'a strs. So the built-in global table can give out &'static strs. This implementation is based in-part on matklad's blog post.

Maybe it's useful for you too!

RustyYato · May 7, 2022, 12:13am

How does this compare to ustr?

I've tried my hand at string interning and found it hard to beat

mwillsey · May 7, 2022, 12:32am

Ah, I wasn’t aware of ustr, looks great! I’ll include it in my benchmarks next time.

Some quick differences:

My symbols are u32s, ustr’s are pointers.
I use hashbrown; that might be faster in some cases.

scottmcm · May 7, 2022, 2:50am

Note that std's hashmap is also hashbrown, so unless you're using something that the std container doesn't support, it shouldn't make a difference.

mwillsey · May 7, 2022, 3:31am

Yes, but at first glance, I believe they are rolling their own hashtable somewhere in there. Also, I use raw_entry, which isn’t available in std.

RustyYato · May 7, 2022, 5:30pm

Yeah, u32 symbol ids is nice, but cached hashes is nicer . Esp if it lets you use the Ustr in map maps. They even have a UstrMap and UstrSet. But this is more dependent on the use-case.
But just comparing the raw interning step, ustr works really well even under contention. Looking forward to your updated benchmarks.

mwillsey · May 9, 2022, 9:16pm

Wow, ustrs performance is definitely best-in-class! It's 1.5-2.0x faster at interning!

Unfortunately, u32 too good to give up for my domain, since creating strings is ultimately not a bottleneck, but working with them might be. It would be really interesting to see if you could get best-of-both worlds!

CAD97 · July 8, 2022, 9:45pm

Without any backing proof, I theorize that ustr finds room for this primarily by taking advantage of negative feature space, namely that

giving out pointers to the cache entry means they don't have to maintain a lookup sidetable or go through for lookup
each interned string is its own separate allocation rather than aggregating them
the OIIO hashtable could be more specialized to the append-only use case than just sharding hashbrown

Two other small differences I found while scanning:

you're using ahash::RandomState as your BuildHasher, whereas ustr is using a fixed-key ahash or hashcity
ustr uses the high hash bits to shard; you're using the low shard bits
- you want to use separate bits to shard between maps and to bucket inside the map to avoid inducing an increase in hash collision rate
- dashmap chooses to use the high bits after skipping the highest 7 (apparently used by hashbrown's SIMD tag)

When using u32 symbols, hashing the symbol is just hashing the u32, which should be extremely quick. Not quite as fast as an identity hash, but still a lot faster than hashing the full string.

I'm back^[1] and comparing interners again aparently. While I don't think you can match ustr's performance ceiling while giving out a smaller-than-usize key, you could provide a precomputed hash by just adding another sidetable.

I might end up providing a sharded version of my research string interner and add a pre-hash sidetable to it as well.

I wrote a string interner comparison in 2020 that compared existing interner's allocation patterns to a maximally compact interner using matklad's described technique. ↩︎

system · October 6, 2022, 9:45pm

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.

Topic		Replies	Views
String as u32 // for fast comparison	5	1797	April 21, 2019
Feedback On Simple String Interner With RefCell And Unsafe code review	11	923	April 14, 2023
Ustr: fast, ffi-friendly string interning	9	1551	February 10, 2020
Global cache of strings help	7	1397	September 23, 2022
Using transmute to cast away lifetime constraints	3	728	January 12, 2023

New string interning crate: `symbol_table`

Related topics