When Does Interior Mutability Make Sense?

I often work in programming language with GC where there is no mutable/shared reference concept. I'm just starting Rust, and come across various problems where I need to use interior mutability with RefCell a lot.

For example, I'm writing a very simple compiler, and I have StringInterner that used to assign id to strings (so that I can just pass an integer instead of &str). The API looks like this:

struct StringInterner {}

impl StringInteterner {
    pub fn intern(&mut self, s: String) -> SymbolId { // SymbolId is basically just a wrapper around usize.
        ... 
    }
    pub fn get(&self, symbol_id: &SymbolId) -> Option<&str> {
        ...
    }
}

The thing is, I need to pass in this StringInterner mutabiy into various places. For example, I have a AstLoader and PackageLoader, PackageLoader depends on AstLoader, and both of them depend on StringInterner, and I ended up having something like this:

struct PackageLoader<'a> {
    string_interner: &'a mut StringInteter,
    ast_loader: &'a mut AstLoader<'a>, // I need a reference because AstLoader is used somewhere else as well.
}
struct AstLoader<'a> {
    string_interner: &'a mut StringInterner,
}

As you can see, the dependency graph probably looks like this:

           ------------------------- StringInterner
          |                                |
          v                                |
PackageLoader <----- AstLoader <------------

And this doesn't compile because we can't borrow StringInteger for both PackageLoader and AstLoader. The way I workaround this is by changing the API of StringInterner into this:

impl StringInterner {
    pub fn intern(&self, s: String) -> SymbolId { ... }
    pub fn get(&self, symbol_id: &SymbolId) -> Option<&str> { ... }
}

Now, I no longer have to use &mut StringInterner. I can just use &StringInterner. Internally, StringInterner will use some kind of interior mutability with unsafe block. But since StringInterner is append only, I can the address of every string stable (thus, calling intern after get won't invalidate get's returned reference).

My questions are,

  1. when does it ok to have interior mutability? I feel like I'm using it a lot, and I'm not sure whether it's an antipattern or not. Additionally, I need to use unsafe quite often and I'm not sure if I don't break any rule.
  2. Where should I put the interior mutability? Should the StringInterner itself handle the interior mutability or the user of the StringInterner that should wrap it?
1 Like

You may want to look into gc crate for GC implementation.

When dealing with potentially cyclic data structures (and when you implement a programming language, the user at any point could have a cyclic data structure) having to use inner mutability is normal.

As for unsafe, it's ideally to be avoided. Runtime cost of checking borrow flag in GcCell shouldn't affect performance too badly.

Whenever you need mutable access from multiple aliases. Your use-case seems to fit this description.

It's not an antipattern. But it can be hard to wrap your head around it sometimes.

That is somewhat alarming. I guess you rely on RefCell::as_ptr a lot? I wouldn't do that. Try sticking to the safe API RefCell offers, e.g. borrow/borrow_mut.

Personally I like to bury interior mutability (and similar access patterns, like synchronized access in multithreaded situations) as deep down as I can and as close to the corresponding data as possible, which makes for a better API in my opinion. So I'd put it inside StringInterner[1], rather than having RefCell<StringInterner> inside PackageLoader and AstLoader.


  1. Which is just a wrapper around a HashMap, I'm assuming? ↩︎

4 Likes

Not really as_ptr per se. But sometimes, I need to get a shared reference (&T) of the type inside the Ref. The way I do it is by converting it into pointer and dereference it back into reference. Since the pointer is inside a Box, I assume it has stable address and thus ok to be converted like this. Here is the code example:

#[derive(Hash, PartialEq, Eq, Debug, Clone, Copy)]
pub struct SymbolId(usize);

pub struct SymbolContext {
    internal: RefCell<SymbolContextInternal>,
}

impl SymbolContext {
    pub fn new() -> Self {
        Self {
            internal: RefCell::new(SymbolContextInternal::new()),
        }
    }

    pub fn intern<T>(&self, symbol: T) -> SymbolId
    where
        T: AsRef<str>,
    {
        self.internal.borrow_mut().intern(symbol)
    }

    pub fn get<'a>(&'a self, symbol_id: &SymbolId) -> Option<&'a str> {
        let internal = self.internal.borrow();
        let Some(s) = internal.get(symbol_id) else {
            return None;
        };

        let s: *const str = s;

        // Safety: it is safe to convert `s` into a reference since we know
        // s has a stable address because it's behind an Rc. And we also know
        // s' lifetime is 'a.
        unsafe { Some(&*s) }
    }
}

struct SymbolContextInternal {
    symbols: HashMap<Rc<str>, usize>,
    index: Vec<Rc<str>>,
}

impl SymbolContextInternal {
    pub fn new() -> Self {
        Self {
            symbols: HashMap::new(),
            index: Vec::new(),
        }
    }

    pub fn intern<T>(&mut self, symbol: T) -> SymbolId
    where
        T: AsRef<str>,
    {
        if let Some(symbol_id) = self.symbols.get(symbol.as_ref()) {
            SymbolId(*symbol_id)
        } else {
            let new_id = self.index.len();
            let symbol: Rc<str> = symbol.as_ref().into();
            self.symbols.insert(symbol.clone(), new_id);
            self.index.push(symbol);
            SymbolId(new_id)
        }
    }

    pub fn get(&self, symbol_id: &SymbolId) -> Option<&str> {
        self.index.get(symbol_id.0).map(|sym| sym.as_ref())
    }
}

You're circumventing the fundamental "1 exclusive XOR 1..n shared" invariant enforced by RefCell at runtime by letting a reference escape the borrow like that. It is only sound as long as you never ever call borrow_mut on the same RefCell while any such reference lives. And because it's the caller of get who has to bear the burden of maintaining the invariant, SymbolContext::get is not a safe function to call and should be marked as unsafe.

2 Likes

Note that this is largely a seperate problem from interior mutability. You need interior mutability to implement an interface which takes only &self, so that you can share the reference in many places and call any method from any one place. This is fine and exactly the purpose of interior mutability.

However to have your interner give out references like this, no operations on your StringInterner may cause the backing container to reallocate. The rust ecosystem simply has no language to talk about these sorts of facts. There is Pin but you need unsafe to exploit any of its properties so it isn't useful for your case. Your usage of unsafe here is probably fine, and necessary, but fundamentally relies on an assumption not expressed by the type system (which is why it needs unsafe) - that Rc gives stable references to its contents; and that you must not remove any elements from your container.

You should be careful of passing the responsibility of managing the RefCell borrows to the caller. In your case it's possible to hide the cell under an interface which will never panic (due to failing to borrow the cell). Hiding the cell here greatly simplifies the contract of your type.

In particular, for this kind of signature, suppose that the AsRef implementation holds a reference to the same cell and tries to recursively borrow it (unlikely but technically possible). Bad things will happen! Better to hide this method from the interface as you've done.

I see, that makes sense. Thank you.

Actually, I cloned the whole string in the symbol, lol. So, the actual string that is stored in my string interner has a its own address. So, I guess that's fine? :thinking:

I see, got it. It just feels weird for me that this kind of problem seems trivial and occured quite a lot in my code, but somehow I need to use unsafe just to make it work.

Not sure that's true, though :thinking: . The intern method doesn't return the mutable reference, so there will never be more than 1 mutable borrow. And get is also the same, it doesn't give the Ref to the caller, so by the time get returned, there is no shared borrow. And the reference returned in get is safe to use because it has stable address (because it's behind an Rc) and never deallocated.

The Rc that's behind the RefCell is shielding you. If you only used Box, say, there would be no UnsafeCell between your temporary RefCell-generated &mut and the strs potentially being aliased by the caller at the same time.

Here's a two-line change to insert and a subsequent example of definite UB (in the playground):

+        self.symbols.clear();
+        self.index.clear();
         if let Some(symbol_id) = self.symbols.get(symbol.as_ref()) {

Note that my change used no unsafe. But Rust programs can only hit UB due to invalid or improperly contained uses of unsafe [1]! The only way you wouldn't need unsafe for this is if Rust didn't have the guarantees that it does [2]. Your code is only sound when intern meets the safety requirements in get, and that's something the compiler can't prove on its own.

Hence, unsafe: the ability to do things the compiler can't prove are sound, but it's up to you to ensure the requirements are met.


  1. or a compiler bug ↩︎

  2. or if it grew global reasoning and threw out the concept of the function API being the contract ↩︎

3 Likes

This doesn't really require a stable address, but instead a non-mutable-aliasing guarantee. If you're only setting each slot once, then you can get this by using OnceCell¹ instead of RefCell. I put together a (possibly over-complicated) implementation of your API that doesn't require any unsafe:

Edit: Simplified example

use once_cell::unsync::OnceCell; // 1.17.0
use std::hash::{Hash,Hasher,BuildHasher};
use std::collections::hash_map::RandomState;

#[derive(Copy,Clone,Debug,Eq,PartialEq)]
pub struct SymbolId(usize);

#[derive(Debug)]
pub struct SymbolTable<T, const CAP:usize> {
    state: RandomState,
    symbols: [OnceCell<T>; CAP],
}

impl<T:AsRef<str>, const CAP:usize> SymbolTable<T,CAP> {
    pub fn new()->Self {
        SymbolTable {
            state: RandomState::new(),
            symbols: std::array::from_fn(|_| OnceCell::new()),
        }
    }
    
    pub fn intern(&self, val:T)->SymbolId {
        let mut hasher = self.state.build_hasher();
        val.as_ref().hash(&mut hasher);
        let base = (hasher.finish() % (CAP as u64)) as usize;
        for i in 0..CAP {
            let i = (base+i) % CAP;
            let slot = &self.symbols[i];
            match slot.get() {
                Some(x) if val.as_ref() == x.as_ref() => {
                    return SymbolId(i);
                }
                None => {
                    assert!(slot.set(val).is_ok());
                    return SymbolId(i);
                }
                Some(_) => { continue; }
            }
            #[allow(unreachable_code)] { unreachable!() }
        }
        panic!("Symbol table is full!");
    }
    
    pub fn find(&self, sym:&str)->Option<SymbolId> {
        let mut hasher = self.state.build_hasher();
        sym.hash(&mut hasher);
        let base = (hasher.finish() % (CAP as u64)) as usize;
        for i in 0..CAP {
            let i = (base+i) % CAP;
            let slot = &self.symbols[i];
            if slot.get()?.as_ref() == sym {
                return Some(SymbolId(i));
            }
        }
        return None;
    }
}

impl<T,const CAP:usize> std::ops::Index<SymbolId> for SymbolTable<T,CAP> {
    type Output=T;
    fn index(&self, i:SymbolId)->&T {
        self.symbols[i.0].get().unwrap()
    }
}

fn main() {
    let symbols: SymbolTable<&'static str, 10> = SymbolTable::new();
    
    let a = dbg!(symbols.intern("a"));
    let b = dbg!(symbols.intern("b"));
    let c = dbg!(symbols.intern("c"));
    let d = dbg!(symbols.intern("d"));
    let e = dbg!(symbols.intern("e"));

    assert_eq!(a, symbols.intern("a"));
    assert_eq!(b, symbols.intern("b"));
    assert_eq!(c, symbols.intern("c"));
    assert_eq!(d, symbols.intern("d"));
    assert_eq!(e, symbols.intern("e"));

    assert_eq!(Some(a), symbols.find("a"));
    assert_eq!(Some(b), symbols.find("b"));
    assert_eq!(Some(c), symbols.find("c"));
    assert_eq!(Some(d), symbols.find("d"));
    assert_eq!(Some(e), symbols.find("e"));
    assert_eq!(None, symbols.find("f"));

    dbg!((symbols[a], symbols[b], symbols[c], symbols[d], symbols[e]));

    dbg!(symbols);
}
Original, more-complicated example
use once_cell::unsync::OnceCell; // 1.17.0
use std::hash::{Hash,Hasher,BuildHasher};
use std::collections::hash_map::RandomState;

#[derive(Copy,Clone,Debug,Eq,PartialEq)]
pub struct SymbolId(usize);

#[derive(Debug)]
struct Table<T>([OnceCell<Vec<OnceCell<T>>>;64]);

impl<T> Default for Table<T> {
    fn default()->Self {
        Self::new()
    }
}

impl<T> Table<T> {
    fn new()->Self {
        Table(std::array::from_fn(|_| OnceCell::default()))
    }
    
    fn get_subtable(&self, subtable: usize)->&[OnceCell<T>] {
        self.0[subtable].get_or_init(|| {
            let mut v = Vec::with_capacity(1<<subtable);
            for _ in 0..v.capacity() {
                v.push(OnceCell::default());
            }
            v
        })
    }
}

impl Table<()> {
    fn encode_index(subtable: usize, slot:usize)->usize {
        let prefix = 1usize << subtable;
        assert!(slot < prefix);
        prefix | slot
    }
    
    fn decode_index(idx:usize)->(usize,usize) {
        let subtable = idx.ilog2();
        (subtable as usize, idx &! (1<<subtable))
    }
}

#[derive(Default,Debug)]
pub struct SymbolTable<T> {
    state: RandomState,
    symbols: Table<T>,
}

impl<T:AsRef<str>> SymbolTable<T> {
    pub fn intern(&self, val:T)->SymbolId {
        let mut hasher = self.state.build_hasher();
        val.as_ref().hash(&mut hasher);
        for st in 0..64 {
            st.hash(&mut hasher);
            let st_ref = self.symbols.get_subtable(st);
            let idx = (hasher.finish() % (st_ref.len() as u64)) as usize;
            let slot = &st_ref[idx];
            match slot.get() {
                Some(x) if val.as_ref() == x.as_ref() => {
                    return SymbolId(Table::encode_index(st,idx));
                }
                None => {
                    assert!(slot.set(val).is_ok());
                    return SymbolId(Table::encode_index(st,idx));
                }
                Some(_) => { continue; }
            }
            #[allow(unreachable_code)] { unreachable!() }
        }
        unreachable!()
    }
}

impl<T> std::ops::Index<SymbolId> for SymbolTable<T> {
    type Output=T;
    fn index(&self, i:SymbolId)->&T {
        let (st,idx) = Table::decode_index(i.0);
        self.symbols.get_subtable(st)[idx].get().unwrap()
    }
}

fn main() {
    let symbols: SymbolTable<&'static str> = SymbolTable::default();
    
    let a = dbg!(symbols.intern("a"));
    let b = dbg!(symbols.intern("b"));
    let c = dbg!(symbols.intern("c"));
    let d = dbg!(symbols.intern("d"));
    let e = dbg!(symbols.intern("e"));

    assert_eq!(a, symbols.intern("a"));
    assert_eq!(b, symbols.intern("b"));
    assert_eq!(c, symbols.intern("c"));
    assert_eq!(d, symbols.intern("d"));
    assert_eq!(e, symbols.intern("e"));

    dbg!((symbols[a], symbols[b], symbols[c], symbols[d], symbols[e]));

    dbg!(symbols);
}

Rust Playground


¹ Currently an external crate, but is in the process of being added to std.

Not sure I got what you're saying.
Are you saying, if I were use Box, it's unsafe because I'm aliasing a box, and aliasing a Box is considered UB?

I think there is a difference between my implementation and yours. Mine doesn't have capacity limit, thus I need stable address. Because otherwise, the vector and the hashmap will rellocate my string into some random address and invalidate the references I returned on get method.
Your implementation actually also require stable address I think, but you handle it by using preallocated array instead of using Rc.
And another difference is, the complexity of intern and get method in your implementation is O(N) where N is the number of entries. If we want to make it O(1), we need hashmap, but that means we need to have stable address for the string so that we can share it with the hashmap and array.

Imho you're overcomplicating things with your unsafe. It's a rather tricky question, whether you can rely on reference stability for heap-allocated data. That could maybe make sense if you tried to store your interned strings inside of Boxes or some contiguous storage, but you already put each of them behind a Rc, which can be cloned cheaply. Just change your signature to directly return Rc<str>, like this:

    pub fn get(&self, symbol_id: &SymbolId) -> Option<Rc<str>> {
        let internal = self.internal.borrow();
        internal.get(symbol_id).cloned()
    }
2 Likes

I might actually try a very different approach of just passing a &mut StringInterner, &mut PackageLoader, &mut AstLoader, etc to whatever function call uses them. Possibly build some sort of Context<'a> struct that combines these &'a muts if they're commonly passed together. You seem to be trying to structure your program in a very Java-ish object-oriented way with long-living graphs of pointers, and I suspect that just isn't a very niceness-making architecture.

Aliasing memory reachable through a &mut is UB, with some exceptions like "unless it's in an UnsafeCell".

1 Like

Yeah, true. That was what my API initially look like. I returned the Rc and carry it everywhere.
But, it feels unidiomatic and have a little overhead if the caller ended up cloning the Rc many times. Even though in my case it is not a big deal, I want to learn to make it more idiomatic and less overhead.

I tried this actually. But the thing is. PackageLoader needs &mut StringInterner as well. So, I can't put it in the Context, because it will look like this:

    ----------------------- StringInterner
   |                                |
   v                                |
Contex <----- PackageLoader <-------

it won't compile because both Context and PackageLoader needs &mut StringInterner.

I'm also thinking about passing &mut StringInterner, &mut PackageLoader, &mut AstLoader one by one to the functions who need them. But it is very tedious. So, I'm trying different alternatives. But, this approach will be my last choice if I don't find any other solution.

And yes, this is my first time writing actual code in non-GC language. I'm still trying to figure out the pattern. In GC'd language, there are no restriction on cyclical reference, self referential struct, or even multiple exclusive reference. So, the pattern in Rust is really different from them.

I see. In my implementation, I uses RefCell outside the Rc. I believe RefCell uses UnsafeCell internally. So, that's ok, right?

It's certainly more idiomatic than writing manual unsafe. Also, Rc is super cheap to clone, it's just a single non-atomic integer increment. Even Arc is cheap to clone in most contexts.

The more practical issue is that, with shared object ownership, you no longer have a clear understanding when the object's lifetime ends and its memory is reclaimed. This may be an issue in more complex applications, and a source of memory leaks.

1 Like

My implementation is an append-only hash map, and I believe you have made a mistake in your analysis:

  • Index::index, my equivalent to your get(), is pretty clearly O(1) as it contains no looping or recursion; it’s just a straight table lookup: self.symbols[i.0].get().unwrap()
  • intern and find are O(N) in the worst case, but O(1) on average assuming a good hash function and a sufficiently low load factor— The internal loop will stop at latest when it finds an empty slot; if the load factor is 50%, this will average out to two slot inspections per call.

For a variable-capacity version, see my original example (hidden at the bottom of my previous post); it never resizes the buffers after initialization and so doesn’t run into the reallocation problem.

When I say you don’t need stable addresses, I mean that you don’t need the addresses to be stable beyond what’s already guaranteed by the existence of an &self reference. Your API allows the StringInterner to be moved or passed by value which will invalidate all of the references you’ve given out, which is the normal way that Rust references work— If you really needed stable addresses, you’d see lots of discussion of Pin in this thread.

Instead, what you really need is a container that can append through a shared reference so that you don’t run into the inner &mut forcing you to invalidate shared references before you want to. OnceCell is generally the primitive you want to use if you have a write-once-and-then-use-shared-references type of problem, and can be used to write these append-only structures.

1 Like

Actually, I think you're right.
I'm planning to make a cached version of this as well. And I definitely need Rc or Arc for that.

OH, that's right. I just realized you implemented a hash table.

Wow, that's a really smart technique to avoid reallocation. I think I get it. Yeah, I think that'll work.

I see. Yeah it seems the OnceCell is quite handy here.
I can't use RefCell because it returns Ref<T> instead of &T
And I can't use Cell because it needs T to be Copy
OnceCell seems perfect for my case.

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.