I'm trying to learn the "right" (or idiomatic) way to do some things in Rust, after a long long time using only languages with garbage collection. I've been trying to translate some Python code as an exercise, and I've across something that has left me wondering how to do it.
I have a program that fetches and processes a large amount of data - usually tens of millions of entries. Each entry has a tag (a small string), and there are hundreds of thousands of possible tags (it's a fixed set). But in this program, we never process data having more than like 50 different tags each time. So what I started think was: instead of duplicating these strings millions of times, could I have some kind of "global" cache? So that I could just have references to them, and avoid moving these strings all around as I move the entries?
I say global, but I don't necessarily mean exactly global, but maybe something inside some struct that lives for the duration of the whole program, which could have a HashSet with all the strings observed so far, and we could just reference these strings (a reference or a Rc pointer).
Since I know that my tags are quite small (less than 10 ascii characters most of the time), I'll probably just copy them, shouldn't be the bottleneck in my program. But what if they were a bit larger? Would it make sense to do something like I said?
Also, I know that since I'm parsing external data, I'll actually be parsing and creating these duplicated strings all the time, so just keeping it instead of discarding it wouldn't make much of a difference in terms of performance, but maybe I could reduce memory usage a lot by not duplicating these for each entry (if for example I'm keeping a lot of these in memory).
Sorry for the long text with lots of hypothetical ideas, I'm just more and more curious about memory management the more I learn about Rust!
This is called "string interning", and there are several crates dedicated to it. You can just search crates.io for this expression.
If you want to do it yourself, you will likely want to start with a once_cell::sync::Lazy<HashSet<String>>.
Copying (i.e., the number of bytes) is almost never the slow part unless the strings are unusually long (say, 10s of megabytes at least). Usually, the slow part of copying a string is allocating a new heap buffer. You can also google for "small string optimization" to learn more about how this problem is usually dealt with.
I never actually implemented SmartBeef IIRC, it stayed just a design experiment. (As I said then, it barely has a use case beyond intellectual curiosity.)
But surprisingly, despite doing a significant amount of size optimization for SSO types and string interners, I never actually considered combining the two approaches... Perhaps it's time to dust off strena again.
A cursory consideration suggests that a 2×usize layout with 100% inline length (16 bytes on 64bit) should be fairly simple to achieve:
and comparison remains a simple equivalence check if you ensure short strings are never interned.
The downside to traditional interning is that traditional interning uses 1×usize (or even smaller) keys and a single shared contextual reference to the interner.
I'm not awake enough to napkin math when which approach is more space optimal. SSO wins when you have “enough” unique small strings to prove outweigh the larger key size, but interners are generally used for when there are a (relatively) smaller number of unique strings and a (relatively) larger number of copies of the keys. “Enough” depends on the overhead of your interner. I think SSO winning on size is extremely niche, so the advantage would instead be avoiding calling out to the interner when resolving a string or interning a new string... which is also a relatively rare operation in a normal interner workload.
A 24 bit key could be stored as u32 inlining strings up to 4 bytes. A 56 bit key could be stored as u64 inlining strings up to 8 bytes. Is it worth the complexity to store those truly tiny strings inline? Actually maybe, since tiny strings have the most overhead to intern. (They keys also could maybe get another bit or two for more complexity accessing an interned string.)
I remember that when I used some sort of global cache for a task that I executed on multiple cores using rayon, the locking (at least I think it was the locking) had such a performance impact that I decided to remove the cache and rather keep copies around.
Not sure if I did something wrong though in my implementation. Nonetheless, I believe that performance can be an issue.