Lifetime puzzle: Eliminate allocations for hashmap keys

The following is a fragment of a larger program in which intern_string is used to stash 10,000–1,000,000 word-sized strings in the storage pool, after which the index is thrown away and the rest of the program works exclusively with the pool and the Substr instances that refer to it.

use std::collections::HashMap;

#[derive(Clone, Copy, Debug, Hash, Eq, PartialEq)]
pub struct Substr {
    pub start: usize,
    pub len: usize,
}

pub fn intern_string(
    storage: &mut String,
    index: &mut HashMap<String, Substr>,
    s: &str
) -> Substr {
    if let Some(ss) = index.get(s) {
        return *ss;
    }
    let start = storage.find(s).unwrap_or(storage.len());
    if start == storage.len() {
        storage.push_str(s);
    }
    let ss = Substr { start, len: s.len() };
    if index.insert(s.to_owned(), ss).is_some() {
        panic!("should never get here due to get() above");
    }
    ss
}

It seems like it ought to be possible to avoid allocating a second copy of each unique string for the hashmap keys, instead somehow referring to the copy in storage. But I can't figure out how. index cannot hold &strs pointing into storage because they would get invalidated by the push_str. And it cannot just be HashSet<Substr> because (AFAIK) that would necessitate Substr having two different Hash and Eq impls, one of which would have to take an extra parameter (namely storage).

The algorithm works if you remove index entirely and just look up all the strings in storage, but it becomes quadratic in the number of words, and it's expected that the input contains many duplicates of each word (which is why we're bothering to intern them in the first place). Replacing the entire mess with a HashSet<String> would make the rest of the program have quadratic time costs, for reasons I'd rather not get into.

What are my options for eliminating the extra allocations?

You can accomplish this with a hashbrown::HashTable<SubStr>, because that lets the caller handle all hashing and equality with closures, so you can temporarily borrow the storage each time.

Another option since you're using String as a linear allocator would be to instead use a data structure intended for that purpose:

use bumpalo::Bump;
use std::collections::HashSet;

pub fn intern_string<'a>(
    storage: &'a Bump,
    index: &mut HashSet<&'a str>,
    s: &str
) -> &'a str {
    if let Some(ss) = index.get(s) {
        return *ss;
    }
    let new_val = storage.alloc_str(s);
    index.insert(new_val);
    new_val
}

This works because Bump::alloc_str takes the self parameter by shared reference. If allowing the string entries to overlap was an important feature (I'd think you wouldn't want that for interning strings, but I don't know the full use case), then you could implement a custom linear allocator of your own which works similarly and only hands out shared references to the underlying buffer.

Some demo of using HashTable:

use hashbrown::HashTable;
use std::hash::BuildHasher;
use std::hash::RandomState;

#[derive(Clone, Copy, Debug, Hash, Eq, PartialEq)]
pub struct Substr {
    pub start: usize,
    pub len: usize,
}

impl std::ops::Index<Substr> for str {
    type Output = str;
    fn index(&self, k: Substr) -> &str {
        &self[k.start..][..k.len]
    }
}
impl std::ops::Index<Substr> for String {
    type Output = str;
    fn index(&self, k: Substr) -> &str {
        &self[..][k]
    }
}

#[derive(Default, Debug)]
pub struct Index {
    hasher: RandomState,
    table: HashTable<Substr>,
}

pub fn intern_string(storage: &mut String, index: &mut Index, s: &str) -> Substr {
    let hash = index.hasher.hash_one(s);

    if let Some(ss) = index.table.find(hash, |&k| storage[k] == *s) {
        println!("{s} already known");
        return *ss;
    }
    print!("inserting {s} new");
    let start = storage.find(s).unwrap_or(storage.len());
    if start == storage.len() {
        print!("  (at end of storage)");
        storage.push_str(s);
    }
    println!();
    let ss = Substr {
        start,
        len: s.len(),
    };
    index
        .table
        .insert_unique(hash, ss, |&k| index.hasher.hash_one(&storage[k]));
    ss
}

fn main() {
    let mut s = String::new();
    let mut i = Index::default();
    let mut interned = vec![];
    for x in [
        "foobar", "bar", "foo", "baz", "bar", "qux", "foo", "q", "azqu", "q", "foo", "wow",
    ] {
        interned.push(intern_string(&mut s, &mut i, x));
    }
    dbg!(&i);
    drop(i); // index no longer needed
    println!("\nstorage: {s:?}\n");
    for &x in &interned {
        println!("{}", &s[x]);
    }
}

(playground)

In case it's not clear why the original lifetime problem exists:

  • presumably can't put the original slices into the HashMap because their lifetime is too short.

  • and can't put a slice of the storage into the HashMap, since that would make the storage immutable, for a good reason: mutating it would potentially reallocate the heap part, moving its address, invalidating the slices in the HashMap.

And std::collections::HashMap doesn't take outside context, thus can't put Substr into it, as the relevant slice can't be looked up. hashbrown has a better API, and Bumpalo offers storage that doesn't move.

For completeness and because I wanted to see how it looks, it is possible to use the standard HashMap, too, if happy with putting a reference to the storage into every Substr instance. This makes it strictly less efficient (in addition to the inefficiency of the RefCell, and making it painful to actually borrow the slice--I left in the beginnings of a SubstrGuard that I started writing until I realized that the example usage code just uses Display, which doesn't need to return access). Thus I'm not suggesting to use this approach! But you can see that if you were required to use the standard HashMap only, it could be made to work: Rust Playground

PS. it also necessitated changing Substr into an enum, to allow for searching, as the standard HashMap requires the key to be of the same type as what's stored as keys (hashbrown does better). Another source of inefficiency.

One could also use scoped_tls to inject the pool in the Hash without needing extra fields (nor RefCell), like this:

scoped_tls::scoped_thread_local! {
    static STRING_POOL: String
}

use std::borrow::Borrow;
use std::collections::HashSet;
use std::fmt::Display;
use std::hash::Hash;

#[derive(Clone, Copy, PartialEq, Eq, PartialOrd, Ord, Hash, Debug)]
pub struct Substr {
    start: usize,
    len: usize,
}

impl std::ops::Index<Substr> for str {
    type Output = str;
    fn index(&self, k: Substr) -> &str {
        &self[k.start..][..k.len]
    }
}
impl std::ops::Index<Substr> for String {
    type Output = str;
    fn index(&self, k: Substr) -> &str {
        &self[..][k]
    }
}

/// newtype around [`Substr`], for new Hash implementation
//
// The simple PartialEq & Eq are still
// logically correct if interning works as intended.
#[derive(PartialEq, Eq)]
struct SubstrInPool(Substr);

impl SubstrInPool {
    /// # Panics
    /// Will panic if used without initialized `STRING_POOL` context
    fn with_str<R>(&self, callback: impl FnOnce(&str) -> R) -> R {
        STRING_POOL.with(|pool| callback(&pool[self.0]))
    }
}

/// # Panics
/// Will panic if used without initialized `STRING_POOL` context
impl Hash for SubstrInPool {
    /// # Panics
    /// Will panic if used without initialized `STRING_POOL` context
    fn hash<H: std::hash::Hasher>(&self, state: &mut H) {
        self.with_str(|s| s.hash(state));
    }
}

/// # Panics
/// Will panic if used without initialized `STRING_POOL` context
impl Display for SubstrInPool {
    /// # Panics
    /// Will panic if used without initialized `STRING_POOL` context
    fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
        self.with_str(|s| f.write_str(s))
    }
}

impl std::fmt::Debug for SubstrInPool {
    fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
        let mut d = f.debug_struct("SubstrInPool");
        d.field("start", &self.0.start);
        d.field("len", &self.0.len);

        if STRING_POOL.is_set() {
            self.with_str(|s| d.field("value", &s));
        } else {
            d.field(
                "value",
                &std::fmt::from_fn(|f| f.write_str("(unknown outside STRING_POOL context)")),
            );
        }

        d.finish()
    }
}

// helper trait to deal with lack of `Equivalent` in `HashSet` API
trait WithStrDyn {
    fn with_str_dyn(&self, callback: &mut dyn FnMut(&str));
}
impl WithStrDyn for SubstrInPool {
    fn with_str_dyn(&self, callback: &mut dyn FnMut(&str)) {
        self.with_str(callback);
    }
}
impl WithStrDyn for &str {
    fn with_str_dyn(&self, callback: &mut dyn FnMut(&str)) {
        callback(self)
    }
}
impl<'a> Borrow<dyn WithStrDyn + 'a> for SubstrInPool {
    fn borrow(&self) -> &(dyn WithStrDyn + 'a) {
        self
    }
}
impl Hash for dyn WithStrDyn + '_ {
    fn hash<H: std::hash::Hasher>(&self, state: &mut H) {
        self.with_str_dyn(&mut |s| s.hash(state));
    }
}
impl PartialEq for dyn WithStrDyn + '_ {
    fn eq(&self, other: &Self) -> bool {
        let mut r = false;
        self.with_str_dyn(&mut |s1| other.with_str_dyn(&mut |s2| r = s1 == s2));
        r
    }
}
impl Eq for dyn WithStrDyn + '_ {}

// newtype to prevent unwanted (panicky) access outside `intern_string`
// with local pool unset for users of `intern_string`
#[derive(Debug, Default)]
pub struct Index(HashSet<SubstrInPool>);

pub fn intern_string(storage: &mut String, index: &mut Index, s: &str) -> Substr {
    if let Some(ss) = STRING_POOL.set(storage, || index.0.get::<dyn WithStrDyn + '_>(&s)) {
        return ss.0;
    }

    let start = storage.find(s).unwrap_or(storage.len());
    if start == storage.len() {
        storage.push_str(s);
    }

    let ss = Substr {
        start,
        len: s.len(),
    };
    STRING_POOL.set(storage, || {
        if !index.0.insert(SubstrInPool(ss)) {
            panic!("should never get here due to get() above");
        }
    });

    ss
}

fn main() {
    let mut storage = String::new();
    let mut index = Index::default();
    let mut interned = vec![];
    for x in [
        "foobar", "bar", "foo", "baz", "bar", "qux", "foo", "q", "azqu", "q", "foo", "wow",
    ] {
        interned.push(intern_string(&mut storage, &mut index, x));
    }

    STRING_POOL.set(&storage, || {
        dbg!(&index);
    });

    drop(index); // index no longer needed
    println!("\nstorage: {storage:?}\n");
    for &x in &interned {
        println!("{}", &storage[x]);
    }
}

(no playground link due to lack of scoped_tls in playground)

Woah ! A word-sized string is what ? 8 bytes at 99% ? Your Substr is 16 bytes already. You're loosing space.
You should be clear about why you do that. If it's to regain space, what I already did in a similar case: a custom storage, limited to 16MB with strings limited to 255 bytes. Now your Substr is only 4 bytes. For the lifetimes you have 2 options:

  • the "clean" one where you keep the lifetime on the reference to your storage everywhere, but it gets ugly fast
  • the "dirty" one where you Box::leak() your storage once it's populated, so you can drop the (now 'static) lifetime

I should really have thought of that. The words are often longer than 8 bytes, but almost never longer than 16 bytes. Here's a histogram:

 3973│       ●●
 3764│      ●●●
 3555│      ●●●
 3346│      ●●●●
 3137│      ●●●●
 2928│     ●●●●●
 2719│     ●●●●●●
 2510│     ●●●●●●
 2301│     ●●●●●●
 2092│     ●●●●●●
 1882│     ●●●●●●●
 1673│     ●●●●●●●
 1464│    ●●●●●●●●
 1255│    ●●●●●●●●●
 1046│    ●●●●●●●●●
  837│    ●●●●●●●●●●
  628│    ●●●●●●●●●●
  419│   ●●●●●●●●●●●●
  210│   ●●●●●●●●●●●●●
    1│●●●●●●●●●●●●●●●●●●●●●●●●●●●●    ●     ●          ●
     └──────────────────────────────────────────────────
      0         1         2         3         4
      01234567890123456789012345678901234567890123456789

I should probably look into "German strings" or similar.

However, the distribution of word repeats looks more like an L-shape. Of ~30,000 words in my test corpus, 60% of all the words are unique, 10% are repeated at least ten times; only 260 are repeated 100 times or more, and 29 are repeated 1000 times or more. So you can see that a string pool is still going to save quite a bit of space -- as long as I use compressed indices of some form, anyway.

With such short strings I'm not sure that would help, compact_str — Rust data encoding library // Lib.rs for example is 24 bytes. I would instead look at a string interning library where the key is only 32 bits, you might not need more than that for unique words. I have used lasso — Rust library // Lib.rs to great effect in the past, but it seems that is now unmaintained. Other options exist but I cannot make specific recommendations.

You could also build your own interner, I did this for a project where I needed zero copy deserialisation of the data. I basically stored a vector of bytes and a 32 bit index into that. At that index would be stored (length, string-data) as an u16 followed by the bytes of the string. From that you can use a single index to reconstruct a &str on demand. Sounds like you could even store the length as a u8 in your case. I wrote about it in this blog post (which is about a larger project that this interner was a part of): Filkoll - The fastest command-not-found handler

This is helpful, and the reference to rkyv in your post is also helpful, as a longer-term goal is to be able to dump the output of the (expensive) next stage of processing after the one I originally posted to disk in a form that can be read back in very efficiently. I probably am going to wind up putting together a custom interner of some sort, and not worry quite so much about transient allocations during initialization.

Probably not going to end up actually using hashbrown::HashTable as several people have suggested options that make more sense in context of the larger problem, but this is the most direct answer to the question I originally asked so I'm marking it as the solution.

Happy you found ny blog useful. Note that rkyv is zero copy deserialisation but not zero copy serialisation. The only format (unless you roll your own with repr C structs) I know of that is zero copy both ways is https://capnproto.org/, but like protobuf you generate code from definition files with it. Which it you don't care about multi language support is far clunkier than derives.

Also, I haven't tested Capn Proto myself, that is just based on what I read about it. I would expect a rust-first library to have better support for complex types for example.