Feedback On Simple String Interner With RefCell And Unsafe

Hi, I've been learning Rust' interior mutability and I'm trying to implement a string interner. Below are my codes. I use RefCell and unsafe blocks to make the intern method accepts immutable reference of self (&self), because I feel like the interner should be able to be shared. Kind of like an Arena.

I have some doubts about this implementation:

  1. Does it make sense to accepts shared reference to intern a string?
  2. Does my implementation use RefCell and unsafe safely?
    I feel it should be safe because I never remove an entry and every String is stored using owned String type which allocate the str in heap, and thus it has stable address (CMIIW though). And also, I never returns any Ref back to the user so the RefCell's borrow and borrow_mut will never panic.
  3. Is there a better way to implement this? I know we can use some kind of trie data structure to reduce the space complexity, but I'm not interested in that for now.
use indexmap::IndexMap;
use std::cell::RefCell;

#[derive(Default, Hash, PartialEq, Eq, Debug, Clone, Copy)]
pub struct SymbolId(usize);

pub struct SymbolContext {
    symbols: RefCell<IndexMap<String, SymbolId>>,
}

impl SymbolContext {
    pub fn new() -> Self {
        Self {
            symbols: RefCell::new(IndexMap::new()),
        }
    }

    pub fn intern<T>(&self, symbol: T) -> SymbolId
    where
        T: AsRef<str>,
    {
        let symbol = String::from(symbol.as_ref());
        let mut sym = self.symbols.borrow_mut();
        let new_id = SymbolId(sym.len());
        let id = sym.entry(symbol).or_insert(new_id);
        *id
    }

    pub fn resolve(&self, symbol_id: &SymbolId) -> &str {
        let s: *const str = self
            .symbols
            .borrow()
            .get_index(symbol_id.0)
            .map(|(s, _)| s.as_str())
            .expect("invalid state, lookup invalid symbol");

        // TODO: check if this is ok!!!
        unsafe { &*s }
    }
}

fn main() {
    let interner = SymbolContext::new();
    let a = interner.intern("string1");
    let b = interner.intern("string2");
    let c = interner.intern("string1");
    let x = interner.intern("string3");
    let y = interner.intern("string4");
    let z = interner.intern("string3");
    println!("{:?} {:?} {:?}", a, b, c);
    println!("{:?} {:?} {:?}", x, y, z);
    println!("{}", interner.resolve(&a));
    println!("{}", interner.resolve(&b));
    println!("{}", interner.resolve(&c));
    println!("{}", interner.resolve(&x));
    println!("{}", interner.resolve(&y));
    println!("{}", interner.resolve(&z));
}

(Playground)

Output:

SymbolId(0) SymbolId(1) SymbolId(0)
SymbolId(2) SymbolId(3) SymbolId(2)
string1
string2
string1
string3
string4
string3

Errors:

   Compiling playground v0.0.1 (/playground)
    Finished dev [unoptimized + debuginfo] target(s) in 0.80s
     Running `target/debug/playground`

Thank you.

The unsafe seems fine.

There's two blog posts I point people to about string interners:

(Disclaimer: the latter is my article)

1 Like

Those won't, but the expect() in resolve() will. The current API offers no way to ensure that you can't give resolve() a symbol obtained from a different SymbolContext instance.

Also, I believe that this is technically undefined behavior: SymbolContext::intern() creates a mutable reference into symbols, but SymbolContext::resolve() also returns an immutable reference into symbols. So if you interleave these in the way it's demonstrated in the playground above, so that the liveness regions of the two references overlap, then you just got yourself an aliasing &mut, of which the mere existence is instant UB.

2 Likes

Really, it's depending on the implementation detail of String that moving it or taking a mutable reference to it does not invalidate pointers into its internal buffer. To avoid this assumption from ever becoming incorrect, I'd probably create my own Symbol wrapper over a definitely-valid raw pointer (Rust Playground):

struct Symbol(*mut str);

unsafe impl Send for Symbol {}
unsafe impl Sync for Symbol {}

impl Symbol {
    fn new(value: &str) -> Self {
        Self(Box::into_raw(Box::from(value)))
    }
}

impl Drop for Symbol {
    fn drop(&mut self) {
        drop(unsafe { Box::from_raw(self.0) });
    }
}

impl Deref for Symbol {
    type Target = str;

    fn deref(&self) -> &Self::Target {
        unsafe { &*self.0 }
    }
}

impl PartialEq for Symbol {
    fn eq(&self, other: &Self) -> bool {
        **self == **other
    }
}

impl Eq for Symbol {}

impl Hash for Symbol {
    fn hash<H>(&self, state: &mut H)
    where
        H: Hasher,
    {
        (**self).hash(state)
    }
}
1 Like

No-no, that's not what I'm talking about. I understand that the stored string/byte array doesn't move when the String is mutated (without reallocation) or moved.

My concern is that the provenance of the returned &str is from the IndexMap (through the locking Ref, and then the RefCell), so it ultimately immutably borrows the IndexMap. But when intern() is called, it also creates a mutable reference to the same IndexMap, and the co-existence of a shared and a mutable borrow to the same object is UB. This is one of the reasons why the lifetime of the reference returned by Ref is tied to that of the guard itself, and not to the RefCell (or the value wrapped in it). The code in the OP violates this condition.

This is accurate... but while the lifetime of the returned &str is tied to that borrow, the validity of the borrow isn't necessarily.

The airtight solution is to store some custom type wrapping raw pointers. Anything else is relying on implementation details of String and formally undecided properties of the dynamic borrowing rules.

But this usage is extremely likely to be okay. (Though not 100% guaranteed!) The actual provenance of the returned &str is unlikely to be entangled with the borrowing of the IndexMap or even String owning it. Dynamically, the &str borrows from the heap pointer that String manages, and is invalidated only by further use of that pointer. (Consider if e.g. the String instead held &'static str; giving out a longer-lived reference than the &String would then obviously be okay.)

Caveat! Remember, I said this relies on String implementation details. Only ever using the String by-ref isolates you from the majority of them, but it's perfectly allowable for moving the String to cause the invalidation of any derived &str pointers. This actually happens with Box today.

I say it's unlikely that String will do this, but it theoretically could[1], so you're implicitly relying on it not doing so.


  1. Sketch: record your address each time a &self method is called by self as usize and shared mutability. When the address changes, you know that all previous borrow lifetimes you handed out have necessarily ended, and thus can justify a mutable reborrow through the pointer that will invalidate previously derived references. ↩ī¸Ž

2 Likes

Yeah, I read the first one already. Yours is really nice though, but at this point, performance and memory usage is not my primary concern. I'm still struggling with the borrow checker and using unsafe safely.

Also, I want to make a version that accepts &self instead of &mut self. It seems every string interner article I found (Including yours) always use &mut self to intern a string.

Ah yes, that's right. I probably need to return Option<&str> isntead. Thank you for pointing this out.

Interesting, this seems quite hard to spot.
Not sure if this is true, though. I understand the returned &str is "technically" borrowing the IndexMap. But, I tricked it already by using unsafe.

But, let's say it is UB, how do I solve this? I guess I could change it into IndexMap<*const str, ...> and then create custom drop implementation to drop it. But, isn't this the same thing? Basically the returned &str will still "technically" borrow the IndexMap. Or do you have other suggestion?.

Here is my updated code to store pointer instead of String: Rust Playground

Yeah, that's true. I think to be really safe, I need to use Box as you suggested.

Important no, using Box<str> is worse than String, and probably would cause UB.

You want a custom struct Symbol(*mut str); which stores Box but into_rawed, to remove any borrowing behavior beyond purely the raw pointer.

strena does because it's necessary, since it uses a single continuous String buffer.

Other interners which do a similar trick of putting multiple interned strings into the same allocation often have a similar limitation, even if they don't ever reallocate strings; pushing to a string does invalidate previously derived references for sure.

I have another published interner which takes &self on interning and also contains a chart comparing most of the notable string internment options, called simple-interner

Of course, the simple, easy, and safe way to make an interner is to use Arc<str>.

1 Like

Oh sorry, I meant using Box::into_raw to create raw pointer *mut str, not a Box<str>.

This one looks promising, thanks. I'll check this.

This seems to run into a broader problem, namely that safe Rust has few methods to talk about:

  • splitting a reference to a container into references to the shape and to the element values, and
  • reasoning about what container operations will invalidate existing references to container elements.

Whether this is a limitation of the type system being unable to express these things, or just the ecosystem not having enough interest in the problem, I don't know. But it leads to the same pattern being re-implemented many times over (I don't mean the string interning pattern specifically, but the application of these concepts to arbitrary containers).

The second problem is perhaps the simpler one. It would appear that IndexMap never invalidates references unless you remove the particular element. But it also does not appear to make any such guarantees, so when you assume that the reference returned by resolve is valid for the lifetime of SymbolContext, you're relying on implementation details of IndexMap. If it does make this guarantee, it isn't expressed in the type system, so necessarily requires unsafe.

The first problem seems to be intractable. I don't think the type system can help here. The mutable alias rules prevent us from even forming multiple mutable references, from which the same location in memory may be reachable, even if we we know our "aliasing" references in fact are never used to reach an identical location. simple-interner does it via an unsafe wrapper around a pointer, same as suggested in this thread:

/// A wrapper around box that does not provide &mut access to the pointee and
/// uses raw-pointer borrowing rules to avoid invalidating extant references.
///
/// The resolved reference is guaranteed valid until the PinBox is dropped.
struct PinBox<T: ?Sized> {
    ptr: NonNull<T>,
    _marker: PhantomData<Box<T>>,
}

impl<T: ?Sized> PinBox<T> {
    #[allow(unsafe_code)]
    unsafe fn as_ref<'a>(&self) -> &'a T {
        self.ptr.as_ref()
    }
}

pub struct Interner<T: ?Sized, S = RandomState> {
    arena: RwLock<HashMap<PinBox<T>, (), S>>,
}

I know this is a code review post, and my response is not exactly in that spirit. However, take it to mean - this problem (of answering 'is my unsafe sound here') is in fact Very Hard.

As it happens, I've recently been working on a crate to help with these problems, based on a simple unsafe primitive that allows the address of a heap object to be moved around even while its referent is being borrowed. The main hard part is trying to wrap it up into an ergonomic API; I'm currently thinking of doing something based on qcell. If everything works out, it should be sufficient for things like this interner, as well as the guest_cell crate which originally motivated it.

1 Like