Memoize a function returning a borrow with the same lifetime as `self`

Hello,

I'm relatively new to Rust, so please bear with me if this thread is too basic. Consider a trait like this:

trait Mapper
{
    fn map<'a, 'b>(&'a self, key: &'b str) -> &'a str;
}

I won't bother you with the details of my concrete use case, but the basic idea is that this returns the address of a given service, identified by key in this example. Our first implementation just stored a HashMap prefilled with the address of all existing services, but that quickly became a burden to maintain. This implementation made it easy, since we can return a borrow for the address and rest assured that it will survive for as long as the Mapper itself.

The thing is that the the address of every service can be systematically built (either reading an environment variable, or constructed from the service name using some convention). So instead of pre-fill a HashMap, we want to switch to an "on-demand"/memoized version. We'd like to keep the signature for compatibility. The projects makes heavy use of async code, so we need some sort of synchronization mechanism (we choose RwLock for concurrent reads).

In this context, a naive implementation fails since you borrow the data from a lock object which gets dropped, so you get the "cannot return value referencing local variable" error.

I found two possible solutions:

(1) leak the address to make it 'static. The leak is mostly not a problem since the object instance usually lives for the whole program execution, but might become a problem if somebody creates short live Mapper objects.

struct MemoizedMapperLeak
{
    list: Arc<RwLock<HashMap<String, &'static str>>>,
}


impl Mapper for MemoizedMapperLeak
{
    fn map<'a, 'b>(&'a self, key: &'b str) -> &'a str {
        // Check if exists
        {
            let read = self.list.read().unwrap();
            
            if let Some(value) = read.get(key)
            {
                return value;
            }
            
            // Drop read lock
        }
        
        // Insert it
        {
            let value_owned = format!("computed-for: {}", key);
            let value_static = Box::leak( value_owned.into_boxed_str() );
            
            let mut write = self.list.write().unwrap();
            write.insert(key.to_string(), value_static);
            
            value_static
        }
    }
}

And (2) use unsafe transmutation to trick the borrow checker. This should not cause problems because the already inserted addresses are never changed, so borrows should be good for the entire lifetime of the object ('a). However, I do not know if this could backfire me in some way.

struct MemoizedMapper
{
    list: Arc<RwLock<HashMap<String, String>>>,
}


impl Mapper for MemoizedMapper
{
    fn map<'a, 'b>(&'a self, key: &'b str) -> &'a str {
        // Check if exists
        {
            let read = self.list.read().unwrap();
            
            if let Some(value) = read.get(key)
            {
                // This is safe because values are never mutated once inserted,
                // so they are guaranteed to exists for the whole 'a lifetime
                return unsafe {
                    std::mem::transmute::<_, &'a str>(value.as_str())
                };
            }
            
            // Drop read lock
        }
        
        // It didn't exist, so insert it
        {
            let value = format!("computed-for: {}", key); // Just placeholder for the real implementation
            
            // This is safe because values are afterwards moved to the list,
            // where they are never mutated nor dropped in the self ('a) lifetime
            let output = unsafe { std::mem::transmute::<_, &'a str>(value.as_str()) };
            
            let mut write = self.list.write().unwrap();
            write.insert(key.to_string(), value);
            
            output
        }
    }
}

See playground

Both leaks and unsafe are kinda spooky, so I wonder if there's some other more idiomatic approach.

Of course, a pragmatic solution could be change the signature and return an owned (cloned) string; and let the compiler tell us all the compatibility problems. There would be only a tiny performance loss due to the clone. But more important than that, as a learning opportunity I would like to know whether there is something fundamentally wrong with my candidate approaches (leak and unsafe transmute), and whether there is a less safe and clone-free option.

I don't think this problem is solvable with safe Rust. The concurrency isn't even the problem here, the &self is.

Think about is more generally: you have a HashMap<K, V>. You want to derive a &V from the hashmap (.get()), and that returns you a &'hashmap V. But you also want to mutate the hashmap (.insert()), and do that with a shared reference, which allows you to still hold a reference to some &'hashmap V.

There are two problems here:

  1. You may replace the previous value while you insert, effectively dropping the original V.
  2. Even if you never replace, the hashmap may need to move things around, and thus invalidating your reference.

The suggested to both of them is the same: either return a smart reference keeping the guarantees at runtime (RwLockReadGuard in your case), or .clone() the value. And if cloning is expensive (like it can be in your case), use Rc<V> or Arc<V>. Both prevent the first problem (because even if the hashmap will drop the value, the Arc will keep it alive), and the second (because Arc has an indirection, so moving it does not move the value). However, while Arc is possible here, it will require you to change the signature to return Arc<str>.

In your case, you don't replace values, neither if the String will be moved this will invalidate the &str reference (again, because there is an indirection), but the compiler doesn't know it.

The usual solution in this cases is either to drop to a less efficient approach (leaking, in your case), or, if not possible, to unsafe. I was unable to find a crate that does the unsafe for you, and I doubt I will, since this is a case that cannot be easily generalized to other value types (you need to specify that the type has an indirection so that moving it will not invalidate the reference. This can be implemented using an unsafe trait, but I doubt someone did that).

I would suggest to go with the leaking approach. In case it is not possible, while you definitely can make this sound (for example, by designing a custom type around str), I don't know if the current transmute() is. It sums down to the question whether String promises that an &str achieved through it will be valid even if it will move. I don't know the answer, so I'll leave it to experts :smiley:

Edit: BTW, the lifetimes are redundant - lifetime elision allows you to write fn map(&self, key: &str) -> &str;.

1 Like

It doesn’t support thread safety, but I managed an approach using an arena and a self-referential struct.

use std::{
    cell::RefCell,
    collections::HashMap,
    rc::Rc,
};

// This trait is fixed by "powers that be"
trait Mapper {
    fn map<'a, 'b>(&'a self, key: &'b str) -> &'a str;
}

use ouroboros::self_referencing;
use typed_arena::Arena;

#[self_referencing]
#[derive(Default)]
pub struct MemoizedMapper {
    arena: Box<Arena<String>>,
    #[borrows(arena)]
    #[not_covariant]
    list: Rc<RefCell<HashMap<String, &'this str>>>,
}

trait Lt<'l> {}
impl<'l, T: ?Sized> Lt<'l> for T {}

impl MemoizedMapper {
    fn map<'a, 'b>(&'a self, key: &'b str) -> &'a str {
        fn f<'a>(
            key: &str,
        ) -> impl Lt<'_> + FnOnce(ouroboros_impl_memoized_mapper::BorrowedFields<'a, '_>) -> &'a str
        {
            move |this| {
                let mut list = this.list.borrow_mut();
                match list.get(key) {
                    Some(value) => value,
                    None => {
                        let value = format!("computed-for: {}", key);
                        let value = this.arena.alloc(value);
                        list.insert(key.to_string(), value);
                        value
                    }
                }
            }
        }

        self.with(f(key))
    }
}

fn main() {
    let mapper = MemoizedMapper::default();

    assert_eq!("computed-for: a", mapper.map("a"));
    assert_eq!("computed-for: b", mapper.map("b"));
    assert_eq!("computed-for: c", mapper.map("c"));
    assert_eq!("computed-for: a", mapper.map("a"));
    assert_eq!(3, mapper.with_list(|list| (**list).borrow().len()));
}

Unless I’m missing a different crate that offers an appropriate arena type, making the above thread-safe is blocked on

https://github.com/SimonSapin/rust-typed-arena/issues/28

because one would need something like a SyncTypedArena.


Edit: Here’s a signature for f that works, too, and doesn’t require the Lt trait

        fn f<'a: 's, 's>(
            key: &'s str,
        ) -> impl 's + FnOnce(ouroboros_impl_memoized_mapper::BorrowedFields<'a, '_>) -> &'a str
2 Likes

:rofl: When I saw this, I immediately thought "well, this probably can't be solved with Rust, however if we had self-referential structs... No, self referential structs are unsafe always". I know about the crates for self referential types, they just haven't came into my mind. Apparently, I haven't used them in real projects :smiley_cat:

Thanks @chrefr, for the insight on the problem. It was very useful.

Thanks @steffahn. I think I understand the concept: the arena provides the storage for the strings, linked to the lifetime of self; and the ouroboros crate provides a way to store a reference to the string in list field. This gave me the idea of using a string interner. The lasso crate offers a thread safe variant, and there's also the arc-interner crate. Not sure if It'd work, but may be worth the try.

1 Like

So, for a bit of context / to try and understand the "intuition vs. Rust reality" mismatch: if you returned a &'a String instead of a &'a str, for instance, your intuition would be wrong: when inserting extra elements into the map, the other entries in the map may get moved, so the &String address is not guaranteed to be stable.

That being said, thanks to the extra level of indirection String ≈ Box<str>, the address to the sequence of UTF-8 bytes themselves (the str) is indeed fixed, provided each stored String itself is not mutated nor dropped, which it isn't, since you never touch an old entry.

Hence why that usage of unsafe ought to be sound, but it is also why it is so hard to express this in Rust: there is no direct / clean way to express that if something (such as a String) is moved, then the str pointee, itself, doesn't move. Worse, because of high-level and very strong aliasing (or rather non-aliasing) guarantees, it is technically UB to get hold of a &str when the Box<str> that owns the str is being moved around! (this is what makes the self-referential crates out there, such as ::owning-ref, and maybe ::ouroboros as well, technically UB).


There are some possible workarounds for these issues:

  1. the first workaround is to use a collection which guarantees references to previously-stored elements not to dangle after further reallocations, featuring one of the two following kind of APIs:

    • either (&mut Collection<'coll>, Item) -> &'coll mut Item,

    • or (&'coll Collection, Item) -> &'coll mut Item (through shared mutability).

    Such special collections do exist, and their very specific design has a name: arenas (in practice, the second API is the one widespread, since it's a bit simpler to use, lifetime-wise).

  2. The "second" workaround is a variation of the first one, regarding which you are totally right, @glueball: interners, which are quite similar to arenas, but for often yielding a magical index / key rather than an actual Rust reference (there is a generalization of this pattern in ::slab).

  3. A more primitive variation of the above things would be to store Arc<str>s rather than Strings (you can .into() from the latter to the former) so as to be able to simply Arc::clone() the guard innards and yield that instead. Simple, cheap, and handy.

  4. Another workaround is not directly answering your OP, but is a nice XY solution to keep in mind sometimes: the type &'lt Referee can often be "ducktyped" by a impl 'lt + Deref<Target = Referee>; it will just require that call-sites sometimes sprinkle a few extra &* here and there.

    And at that point you can start yielding lock-holding guards, which enables a whole set of solutions. In this instance, usage of ::dashmap, an efficient RwLock<HashMap<…>>, of sorts, could be interesting.

  5. Finally, another option is to replace the usage of guards (and with them, the need to sprinkle &*) with the CPS (Continuation-Passing Style) pattern, that is, to have the following API:

    - fn get_thing (&'_ self) -> &'_ Item
    - fn get_thing (&'_ self) -> impl '_ + Deref<Target = Item>
    + fn with_thing (&self, yield_: impl FnOnce(&Item))
    // that is:
    + fn with_thing (&self, yield_: impl for<'local> FnOnce(
    +     &'local Item
    + ))
    
    • A slightly better signature

      One which allows the callback to return the result of some computation:

      fn with_thing<R> (
          self: &'_ self,
          yield: impl for<'local> FnOnce(
              &'local Item /* yielded value */
          ) -> R,
      ) -> R
      

    so that instead of doing:

    let thing = coll.get_thing();
    let thing = &*thing;
    stuff…
    

    or, directly:

    match &*coll.get_thing() { thing => {
        stuff…
    }}
    

    with the with_thing() API, you'd write:

    coll.with_thing(|thing| {
        stuff…
    })
    

    The implementation is then quite simple:

    fn with_thing<R> (&self, yield_: impl FnOnce(&Item) -> R) -> R
    {
        let guard: Ref<'_, Item> = { … };
        yield_(&*guard)
    }
    

    Both of these things can be made more ergonomic through:

  • #[with('local)]
    fn thing (&self) -> &'local Item
    {
        let guard: Ref<'_, Item> = { … };
        &*guard
    }
    
    #[with]
    fn some_call_site ()
    {
        #[with]
        let thing: &Item = coll.thing();
        stuff…
    }
    
7 Likes

My favourite solution to these kinds of problems is @Yandros (2), where you don't offer the lifetime guarantee of a reference, and instead give an object that, given the collection, knows how to find the element (these objects can be very small). You lose the runtime guarantee that the collection is still around, and so have an extra fail condition, but you can just say that if that condition is hit it is a programmer error and panic.

A very similar strategy is @Yandros (3). The advantage here is that if you grab a reference you know that the collection item will live as long as your reference. The trade-off is that cloning is slightly more expensive (re. the token) and depending on your application you could use more memory, which may or may not be a problem. If you want to invalidate your references when the collection is dropped, then you can use a rc::Weak or sync::Weak, and you're functionally equivalent to (2) again.

2 Likes

If your string values don't need to be mutated, what about storing an Arc<str> instead of String internally? That way you can bump the reference count and return Arc<str> instead of &'a str, side-stepping the lifetime issues and also allowing people to hold onto the string for as long as they need to.

2 Likes

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.