I made a new string interning crate:
It includes 3 features that I've needed in a couple situtations:
- It (optionally) includes a built-in global symbol table, so you can just say
GlobalSymbol::from("foobar") to intern things into the global table.
- It's well-suited to concurrent access; a symbol table is sharded to reduce lock contention. In my limited benchmarking, it's the fastest string interner under medium/high contention, and reasonably fast in no/low contention.
- It hands out stable
&'a strs. So the built-in global table can give out
&'static strs. This implementation is based in-part on matklad's blog post.
Maybe it's useful for you too!
How does this compare to ustr?
I've tried my hand at string interning and found it hard to beat
Ah, I wasn’t aware of ustr, looks great! I’ll include it in my benchmarks next time.
Some quick differences:
- My symbols are u32s, ustr’s are pointers.
- I use hashbrown; that might be faster in some cases.
Note that std's hashmap is also hashbrown, so unless you're using something that the std container doesn't support, it shouldn't make a difference.
Yes, but at first glance, I believe they are rolling their own hashtable somewhere in there. Also, I use raw_entry, which isn’t available in std.
u32 symbol ids is nice, but cached hashes is nicer . Esp if it lets you use the Ustr in map maps. They even have a
UstrSet. But this is more dependent on the use-case.
But just comparing the raw interning step,
ustr works really well even under contention. Looking forward to your updated benchmarks.
ustrs performance is definitely best-in-class! It's 1.5-2.0x faster at interning!
u32 too good to give up for my domain, since creating strings is ultimately not a bottleneck, but working with them might be. It would be really interesting to see if you could get best-of-both worlds!
Without any backing proof, I theorize that ustr finds room for this primarily by taking advantage of negative feature space, namely that
- giving out pointers to the cache entry means they don't have to maintain a lookup sidetable or go through for lookup
- each interned string is its own separate allocation rather than aggregating them
- the OIIO hashtable could be more specialized to the append-only use case than just sharding hashbrown
Two other small differences I found while scanning:
- you're using
ahash::RandomState as your
BuildHasher, whereas ustr is using a fixed-key ahash or hashcity
- ustr uses the high hash bits to shard; you're using the low shard bits
- you want to use separate bits to shard between maps and to bucket inside the map to avoid inducing an increase in hash collision rate
dashmap chooses to use the high bits after skipping the highest 7 (apparently used by hashbrown's SIMD tag)
u32 symbols, hashing the symbol is just hashing the
u32, which should be extremely quick. Not quite as fast as an identity hash, but still a lot faster than hashing the full string.
I'm back and comparing interners again aparently. While I don't think you can match ustr's performance ceiling while giving out a smaller-than-
usize key, you could provide a precomputed hash by just adding another sidetable.
I might end up providing a sharded version of my research string interner and add a pre-hash sidetable to it as well.