Downgrade Mutable Borrow to Immutable Borrow

Hey everybody,

another lifetime question I'm afraid, any help is appreciated.
Basically I'm trying to implement a caching functionality in my struct, meaning that data is only requested once and then stored, any subsequent requests should only return the already cached data. I want this cache to own the data, and the callee just receives a reference.

A MWE would look like this:

struct Cache {
    data: Option<i32>, // init with None
}

impl Cache {
    fn get(&mut self) -> &i32 {
        if self.data.is_none() {
            let new_data = request();
            self.data = Some(new_data);
        }

        &self.data.as_ref().unwrap()
    }
}

fn request() -> i32 {
    32
}

The problem arises, when I access the data twice:

let mut c = Cache { data: None };
let data1 = c.get();
let data2 = c.get(); // borrow as mutable more than once at a time
println!("{} {}", data1, data2);

I do understand why this does not compile. The question is how to resolve this. Obviously I only need the mutable reference during the update of the local data memory, the returned value works with an immutable reference.

Following solutions I am considering (given that I'm quite new to rust):

  1. Missing lifetime annotations to achieve what I just described. To my understanding this would require having &'a mut self and &'b self as parameters, which doesn't work.
  2. My approach to implementing a cache is wrong. I don't see how to implement it otherwise, meaning how to split the mutable and immutable part, or how I would structure it differently
  3. & references are not enough. Although I'm not familiar with these yet, maybe I need sth like Ref or RefCell?
  4. My concept of ownership is wrong. Given that rust revolves around ownership and borrowing, I keep a close eye that data is not duplicated unnecessarily (I'm working a lot with C#, meaning I'm used to pass-by-reference calls instead of duplication). Meaning, I try to conceptualize which struct should own the data and who gets references. Perhaps, that's also not the way to go about it.

In my code the requested data is not a simple i32 but some kilobytes coming from a http request, hence this caching idea.

Thanks,
Plexian

Rust does not offer the option of downgrading a mutable borrow; it would require a distinct type of borrow with more complex lifetime rules to support this, because the current behavior of mutable reference lifetimes is actually required for soundness. Even if it did, though, this couldn't work because while data1 exists, you can't call c.get() again because the function signatures don't (and can't) tell the compiler that “because this was already called, the second call won't actually mutate anything”. So the compiler has to assume that data1 is invalidated.

So, forget about the idea of downgrading; the question is how to write your cache without using mutable references. The answer to that is OnceCell, a cell that can be written to exactly once, so filling it does not require &mut and it can always return an & reference to its contents.

use once_cell::sync::OnceCell;

struct Cache {
    data: OnceCell<i32>,
}

impl Cache {
    fn get(&self) -> &i32 {
        self.data.get_or_init(request)
    }
}

fn request() -> i32 {
    32
}

fn main() {
    let c = Cache {
        data: OnceCell::new(),
    };
    let data1 = c.get();
    let data2 = c.get();
    println!("{} {}", data1, data2);
}
10 Likes

@kpreid Beautiful, thanks a lot! So I wasn't that far off by suggesting RefCell.

A few follow-up questions:

  • Given that OnceCell now is not some Rust syntax but a struct within std, I could've replicated this myself if I hadn't known about this. After looking at the docs and the source code, I understand that such behaviour (less-restrictive borrowing) is only possible by using unsafe code blocks and explicit pointers. Is that correct?
  • Generally speaking about the std::cell module, do I have to worry about where my data is? I just assume now, the data is stored once in memory and owned by my struct, and any callees receive a immutable borrow with the get method, correct?
  • I seem to compare the concepts of pointers in C (or references in C#) and borrowing in Rust too much (given both use the ampersand). For now I feel like immutable borrows are like pointers, which the restriction of having no L-value (can't write to it) and an automatic cleanup t the end of its lifetime. I'm not sure if that's even the right way to think about that, any suggestions?
  • OnceCell is still a nightly feature. Intuitively, I had thought I won't need any of such features in the near future, but the more I read about rust the more common it seems to use such features. Is it just common by now to use nightly? Seems like its using the stable version anyway, unless any feature is explicitly activated with crate attributes.

Thanks again!

It depends on what you mean by "less restrictive". If you want to implement shared mutability, that's only possible using unsafe.

Worry in what sense? Data in a Cell/RefCell is just like any other data.

You are confusing many different things here.

  • First, both mutable and immutable borrows are pointers. They indirectly refer to some value in memory.
  • However, they are smarter than C pointers because the compiler can reason about their validity at compile-time, based on their lifetime parameter. The direct equivalent of C pointers in Rust would be the raw pointer types *const T and *mut T.
  • References (and raw pointers) don't cause the referred value to be dropped when the reference goes out of scope. That only happens when the owner of the value goes out of scope (e.g. a variable that you assigned a value). If references dropped their referent, that would be unsound. It would lead to use-after-free errors.

Did you check the link above? @kpreid suggested that you use the once_cell crate, not the nightly std-based OnceCell. That works fine with stable.

Generally speaking, you shouldn't need to use nightly for simple things like this. Stick to stable unless you really can't.

2 Likes

I guess I can't even respond to that. When I took my first steps with rust I didn't use any borrows at all, and learned that data is copied if needed (works because the types I used already implemented the copy trait). Now that my program handles some bigger data I wanted to avoid that, but probably just got overly cautious. To my understanding, as long as I don't slap the derive Copy attribute on my structs, nothing unwanted happens anyway. So no question, just a misunderstanding from my side.

I like the idea of "smart pointers", thanks! I saw that the OnceCell implementation uses such raw pointers, which makes sense to me how this is even possible to implement in Rust.

I have not, my bad. Just saw the keyword and googled myself. That means my intuition about nightly is correct, seemed weird to use preview features for such a simple thing.

One last thing: If I were to implement some sort of reset (cache invalidation in that sense), meaning that during runtime I set the data field back to OnceCell::new(), all previous references to the data behind the cell are now invalid. In a simple case like above, I would type

let mut c = Cache {
    data: OnceCell::new(),
};
let data1 = c.get();
let data2 = c.get();
c.data = OnceCell::new();
println!("{} {}", data1, data2);

This won't compile for obvious reasons, but can I assume that this is generally the case? Or might there be a scenario where I'd have to make sure as a developer, that all references are then reset as well, since the borrow checker can't find that. I'm unsure due to the usage of these raw pointers and unsafe sections.

In general, you can't create dangling pointers in safe Rust. If you manage to, then at least one of the unsafe abstractions you are using (either 3rd-party code or your own code) is/are unsound.

If you never use unsafe yourself, and only trust unsafe-using 3rd-party code that has been thoroughly tested and audited, then you will never need to worry about use-after-frees in Rust. (The unsafe in once_cell is most probably fine and sound.)

1 Like

OnceCell is safe — that is, you don't have to call any unsafe functions to use it. Therefore, you can expect that you cannot use it in a way that would cause invalid references to exist. (If that were possible, we would say that OnceCell is unsound, and recommend against using it until such time as the bug were fixed.)

In this particular situation, c.data = OnceCell::new(); requires mutable access to c (just like in the situation without any OnceCell, because at this level it is an ordinary assignment) and therefore you cannot perform that assignment while any borrows of the contents of the OnceCell exist.

If you wanted to be able to reset the cache without this conflict arising, then there are several possible paths:

  1. Use a more general interior mutability primitive that allows multiple writes, such as RefCell (or RwLock for thread safety). But this necessarily means you do not get to take a simple & borrow of the contents, because it's not okay for the contents to be borrowed while they might be modified, or modified while they might be borrowed.
  2. Don't take a & reference to the cache.
    1. Copy/clone the data out, instead of taking a reference. Thus, the data can outlive the current state of the cache, and is not tied to it by lifetime.
    2. Use Rc or Arc: Store an Rc<MyCachedData> inside the cache (i.e. OnceCell<Rc<i32>>, though that's wasteful in the simple case of an i32) so that the Rc can be cloned to keep the value that is or was in the cache alive independent of the cache.
  3. Combine 2.1 and 2.2 by using a RefCell<Rc<MyCachedData>> or Mutex<Arc<MyCachedData>> — this allows arbitrary changes to the cache and allows hanging onto the value the cache had at any time.
3 Likes

@H2CO3 @kpreid
Well that sounds great, thanks! Regarding the more advanced solutions, I doubt I'll need them. If I update the cache, I do not want any left-over borrows to the data, so having the compiler warn (error) me about that sounds just right.

Unfortunately, I got excited a little to fast; I made my MWE too simple, sorry. My cache does not have a single member, but rather a list of them, due to multiple request types I want to cache. Technically I can have a HashMap which maps the request parameter to the OnceCell, but that would mean the HashMap has to be initialized with all possible request parameter values to avoid a mutable reference in the get method. This works, but only because my request type is an enum and finite.

use std::collections::HashMap;
use once_cell::sync::OnceCell;

#[derive(PartialEq, Eq, Hash)]
enum RequestType { One, Two, Three }

struct Cache {
    data: HashMap<RequestType, OnceCell<i32>>,
}

impl Cache {
    fn get(&self, rt: RequestType) -> &i32 {
        self.data.get(&rt).unwrap().get_or_try_init(|| request(rt)).unwrap()
    }
}

fn request(rt: RequestType) -> Result<i32, String> {
    Ok(match rt {
        RequestType::One => 32,
        RequestType::Two => 129,
        RequestType::Three => -3,
    })
}

pub fn main() {
    let c = Cache {
        data: vec![  // this looks wrong to me
            (RequestType::One, OnceCell::new()),
            (RequestType::Two, OnceCell::new()),
            (RequestType::Three, OnceCell::new()),
        ] .into_iter().collect(),
    };

    let data1 = c.get(RequestType::One);  // Request
    let data2 = c.get(RequestType::Two);  // Request
    let data3 = c.get(RequestType::One);  // Cache
    println!("{} {} {}", data1, data2, data3);
}

But any other way, I need to have mutable access to the contents of whatever cell I use, which leaves me back at the initial problem again. I tried using a RefCell, but this understandably doesn't compile (cannot return value referencing temporary value):

struct Cache {
    data: RefCell<HashMap<RequestType, i32>>,
}

impl Cache {
    fn get(&self, rt: RequestType) -> &i32 {
        self.data.borrow_mut().get(&rt).unwrap()
    }
}

That sounds like you should take my option 3, then: store Rcs in the cache.

use std::cell::RefCell;
use std::collections::{hash_map::Entry, HashMap};
use std::rc::Rc;

struct Cache {
    data: RefCell<HashMap<RequestType, Rc<i32>>>,
}

impl Cache {
    fn get(&self, rt: RequestType) -> Rc<i32> {
        match self.data.borrow_mut().entry(rt) {
            Entry::Occupied(oe) => oe.get().clone(),
            Entry::Vacant(ve) => {
                let new_value = Rc::new(request(ve.key()).unwrap());
                ve.insert(new_value.clone());
                new_value
            }
        }
    }
}

Playground

This way, the cache is only ever borrowed (in any sense) while it is queried, not while the data from it is being used (because ownership of the data is shared, via Rc, between the cache and its clients), and you can add new entries whenever you like. Note that:

  • Rc<i32> is pointless; if your data is that small then you should just copy it out of the cache. I assume that i32 is a stand-in for some larger, more expensive structure for which the Rc makes sense.

  • If you need this cache to be used by multiple threads, substitute Arc for Rc and Mutex for RefCell.

3 Likes

That feels overly complicated for what I want, but I guess you are right.
I tried using data: HashMap<RequestType, OnceCell<i32>> as a type, and it works fine too, with the drawback of having to initialize the hashmap with all possible request types which becomes a little tedious, so yours looks better as well.

I understand the RefCell is now used because the contents are continuously updated due to the hashmap, and borrow_mut is not available on OnceCell (makes sense, thats the distinction between OnceCell and RefCell).
However I'm not quite sure about the usage of Rc here. Normal references can't be used, because the lifetime couldn't be guaranteed of the return value, as I said in my previous post (cannot return value referencing temporary value). I guess the Rc is just a way to circumvent this, and we can safely use clone because it just clones the reference, not the value behind it. Reading the documentation however, Rc is meant to be used if data has more than one owner, which is not really the case - this confuses me a little.
Moreover, now that that the method returns a Rc<i32>, I use Rc::as_ref() to get the &i32, which all subsequent methods expect, as they did before. Is this correct? Seems weird to be able to use a reference here, I don't understand why I needed the Rc to begin with (exceptthat my code didn't compile).

And yes, i32 is just a placeholder :slight_smile:

Note that a cache is a classic reason to use interior mutability. You could consider just making your method take &self, and use RefCell/Mutex/whatever internally.

Strictly speaking, if you have a single knowable fn(RequestType) -> ItemType, it is sound to define a fn Cache::get(&self, RequestType) -> &ItemType which both lazily constructs the map and only calls the constructor once per request is possible, but it's nontrivial.

It's not possible with HashMap<K, OnceCell<V>>, though, because when the hashmap grows, all of the map values may be relocated, and this must invalidate any references to the values.

The shape where it becomes possible looks like HashMap<K, Box<V>>, so that the value in the map doesn't move when the map resizes. Doing so is necessarily unsafe, but it's the pattern any interning crate implements. (I don't know of a crate implementing a mapped interner (a "OnceMap", perhaps); perhaps I'll implement one, since implementing interners seems to be a minor hobby of mine[1].)

If your cached values live essentially for the rest of the program's lifetime, though, and don't have a significant Drop handler that you want to run, you can Box::leak your values to get &'static ItemType, and then you can just have HashMap<K, &'static V> take mutable borrows and give out &'static V just fine.

Another safe possibility would be to directly use a map that allows concurrent modification, such as dashmap.

As a side note, if your RequestType is a simple enum and your map is dense, consider using enum-map instead of a hashmap.


  1. I've published one (simple-interner, the one linked above), contributed to two more, and wrote a small research one with a blog post (strena) that isn't published because it was the contribution leading to improving the two others with the technique/benefit demonstrated there. ↩︎

1 Like

Probably won't publish, but might once I've gotten around to fixing meta/config deficiencies in my other crates.

1 Like

As it happens, I've recently been working on a crate (owning-key) for implementing these kinds of things safely. The idea is to separate the addresses from the ownership of the underlying values. First, you create a Key, which lets you turn a Box<T> into a LockedBox<T> that remembers the Key it came from. Then, given a &'a Key and a &LockedBox<T>, you can retrieve a &'a T, and given a &'a mut Key and a &LockedBox<T>, you can retrieve a &'a mut T. This is sound, since no one else can access the T without invalidating the reference to the Key.

With that, you can safely use a HashMap<K, LockedBox<OnceCell<V>>> for this use case.

But I've been procrastinating on finishing and publishing the crate, since there are a few API details I've been struggling to nail down. In particular, if you drop the Key before turning the LockedBox<T> back into a Box<T>, then the value inside it will get leaked. So I wanted to create an advanced Key type that will remember the values it has locked, so that when dropped it will unlock and drop all of them. But this requires additional cooperation from the locked object type (since I'm trying to make this generic for LockedRc, LockedVec, etc.), to ensure that a value is not unlocked without being removed from the Key's list of values. And that requires attaching a persistent identifier to a locked value, which is tricky to make work generically with any user-defined type.

This is a case where the data has more than one owner. Each cached value is owned simultaneously by the cache and the user(s) of the cache that fetched that value. Using Rc or Arc enables either one to hang onto the value as long as it wants, or not, without interfering with the other owner (provided that neither wishes to mutate the value).

As others have noted, it's possible to build a cache that returns & references. Sometimes that's the right choice. However, that sort of cache can never be cleared (flushed), for example, and the Rc approach is simple and straightforward. (It's the same behavior you would implicitly get if you wrote a cache in Python, Ruby, Java, JavaScript, etc. except that it prohibits mutation, which is usually preferable.)

3 Likes

It would also rely on implementation details of the hashmap implementation, and that might be a bad idea. If I understood the code correctly, last time I looked at the hashbrown code, I was surprised to see that apparently upon an insert when re-hashing while growing the map fails, there are cases where the implementation will start to drop remaining not yet rehashed values in the map before/while panicking, which could easily result in soundness issues[1] if unsafe code is used to keep references into the targets of the Boxes in a HashMap<K, Box<V>> alive during an insertion operation.


  1. at least unless other countermeasures are put into place such as aborting on panics during inserts ↩︎

The whole point of using an Rc here is to artificially make the values have multiple owners. Since borrowing them directly from the cache doesn't work, you have to clone the values out of the cache upon access. Now this could potentially be expensive for arbitrary cloneable types, and also not all types are cloneable at all. The solution to both problems (expensive or impossible cloning) is to use Rc which is always cheaply cloneable.

There's nothing weird about this, it's perfectly logical. If you know you have an existing Rc, then that must have kept its referent alive. Furthermore, it heap-allocates, so moving an Rc doesn't move its referent. In fact, it would be impossible to implement an Rc if this were not true, and this is one of the reasons why it has to heap allocate. (The other main reason being shared ownership – one Rc couldn't possibly keep alive another, arbitrarily-scoped Rc's referent if that lived on the stack.)

The reason why you weren't able to directly return references from the cache is that they would have been invalidated had the cache been mutated in any way (insertion or deletion):

  1. In your first attempt, you took a &mut self in your insertion function and returned a &Value from it, which – correctly – tied the lifetime of the value to that of the cache, preventing a use-after-free bug, whereby reallocating the internal map could have moved its values, invalidating the outstanding reference.
  2. In your second attempt with RefCell, you couldn't return a reference directly, because a RefCell has to track all of its borrows for ensuring soundness (exclusive mutability), which is done via the Ref and RefMut RAII guard types. If you could obtain a reference directly to the contents of a RefCell, that would be unsound, because then there would be a reference that the cell can't keep track of. (You could then borrow it mutably and immutably at the same time, for example.)

Once you switch to ownership by cloning the value out of the cache, neither problem applies any more. But as far as the compiler is concerned, you still couldn't return a reference to the inside of the Rc directly. The compiler doesn't understand that an Rc heap-allocates its inner value and provides shared ownership; it only knows about the signature of its Deref or AsRef impl, which ties the borrow's lifetime to that of the specific Rc instance.

However, the trick is that by introducing the shared ownership, you can get arbitrarily many clones of the same Rc, return that clone by value, and only get out temporary references for the short time frames you actually need to do something with the pointed object. The compiler is now happy, because you never apparently borrow from the cache directly, so re-allocating it is safe.

Of course, this requires that Rc heap-allocates and doesn't invalidate the inner pointer when moved – accordingly, Rc's Deref/AsRef/etc. implementations are unsafe, they can't be verified by the compiler, because, as I mentioned, the compiler doesn't directly understand heap allocation.

3 Likes

I see that elsa does use a standard hash map. I will need to check whether this can be exploited unsoundly when I'm back at a PC next week.

Edit: The same thing applies to the simple_interner crate mentioned above in this thread.

Good to know, seems like interior mutability is more common than I anticipated.

That's what I was missing, thanks a lot. I use the Rc to have heap-allocated memory, which (unsafely) manages its contents. Getting values out of the hashmap now isn't tied to any lifetime reference, as I just clone the pointer to my heap memory (the Rc), and any local refs I use after (Rc::as_ref) are just bound to the clone of the Rc itself, which is perfectly fine. Lastly, I need the RefCell to avoid having a mutable ref to self when accessing the HashMap.

As of now I'm still unfamiliar with most of these traits (Deref, AsRef, Ref, RefMut, Drop), I'll do some more reading and experimenting with that in the future. My problem has been solved, mostly thanks to @H2CO3 and @kpreid , thanks again for your help. I'll mark a solution and tag this thread as resolved (if there is a setting for that).

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.