Is it sound to push to a Vec (within UnsafeCell) when elements are borrowed?

I'm trying to create a data structure in which to store cached strings. My inclination is to store the strings in a Vec<String> and push new allocations to the end of the vector, returning an immutable &str reference as follows:

#[derive(Default)]
pub struct StringArena {
    allocs: UnsafeCell<Vec<String>>,
}

impl StringArena {
    pub fn new() -> Self {
        Self::default()
    }

    pub fn push(&self, s: String) -> &str {
        let allocs = self.allocs.get();
        unsafe {
            (&mut *allocs).push(s);
            let len = (&*allocs).len();
            (&*allocs).get_unchecked(len).deref()
        }
    }
}

However, I'm not sure if this is a sound use of UnsafeCell since I am creating a mutable reference to the cell's interior while outstanding immutable &str borrows of the individual strings exist.

For instance, in the following code, the second call to arena.push(...) mutates the vector while str1 exists:

let arena = StringArena::new();
let str1 = arena.push("Hello".to_string());
let str2 = arena.push("World".to_string());

I am inclined to think my code is sound since calls push(...) don't directly affect existing &str references, as long as I only access the vector through this API. However, I don't know the specifics of Rust's aliasing rules, so I would appreciate clarification on whether my code invokes undefined behavior. Thank you!

1 Like

For what it's worth, Miri does not identify any UB in this test program:

fn main() {
    let arena = StringArena::new();
    let s1 = arena.push("a".into());
    let s2 = arena.push("b".into());
    println!("{} {}", s1, s2);
}

(This doesn't provide a definite answer, but it's a useful first step.)

There’s definite UB as written but not because of aliasing; len needs to be read before pushing to the vec (or else indexing at the end should be len - 1).

In terms of aliasing violations, it’s an interesting question: the code does grab a &mut to the vec while there are outstanding shared refs to data (or “paths”) reachable from the vec. As far as I know, that’s UB from an aliasing perspective. But I might be wrong and/or there’s some subtle difference in this example. The &mut to the vec doesn’t escape anywhere though so that may mitigate things a bit.

1 Like

I'm pretty sure you meant .get_unchecked(len - 1) here. As-is it currently doesn't output anything in playground: Rust Playground

And MIRI catches it:

error: Undefined Behavior: type validation failed: encountered uninitialized raw pointer at .pointer
   --> /playground/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/alloc/src/raw_vec.rs:223:9
    |
223 |         self.ptr.as_ptr()
    |         ^^^^^^^^ type validation failed: encountered uninitialized raw pointer at .pointer
    |
    = help: this indicates a bug in the program: it performed an invalid operation, and caused Undefined Behavior
    = help: see https://doc.rust-lang.org/nightly/reference/behavior-considered-undefined.html for further information
            
    = note: inside `alloc::raw_vec::RawVec::<u8>::ptr` at /playground/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/alloc/src/raw_vec.rs:223:9
    = note: inside `std::vec::Vec::<u8>::as_ptr` at /playground/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/alloc/src/vec/mod.rs:1070:19
    = note: inside `<std::vec::Vec<u8> as std::ops::Deref>::deref` at /playground/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/alloc/src/vec/mod.rs:2002:40
    = note: inside `<std::string::String as std::ops::Deref>::deref` at /playground/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/alloc/src/string.rs:2132:43
note: inside `StringArena::push` at src/main.rs:18:13
   --> src/main.rs:18:13
    |
18  |             (&*allocs).get_unchecked(len).deref()
    |             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
note: inside `main` at src/main.rs:26:14
   --> src/main.rs:26:14
    |
26  |     let s1 = arena.push("a".into());
    |              ^^^^^^^^^^^^^^^^^^^^^^
    = note: inside `<fn() as std::ops::FnOnce<()>>::call_once - shim(fn())` at /playground/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/ops/function.rs:227:5
    = note: inside `std::sys_common::backtrace::__rust_begin_short_backtrace::<fn(), ()>` at /playground/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/sys_common/backtrace.rs:125:18
    = note: inside closure at /playground/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/rt.rs:66:18
    = note: inside `std::ops::function::impls::<impl std::ops::FnOnce<()> for &dyn std::ops::Fn() -> i32 + std::marker::Sync + std::panic::RefUnwindSafe>::call_once` at /playground/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/ops/function.rs:259:13
    = note: inside `std::panicking::r#try::do_call::<&dyn std::ops::Fn() -> i32 + std::marker::Sync + std::panic::RefUnwindSafe, i32>` at /playground/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/panicking.rs:379:40
    = note: inside `std::panicking::r#try::<i32, &dyn std::ops::Fn() -> i32 + std::marker::Sync + std::panic::RefUnwindSafe>` at /playground/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/panicking.rs:343:19
    = note: inside `std::panic::catch_unwind::<&dyn std::ops::Fn() -> i32 + std::marker::Sync + std::panic::RefUnwindSafe, i32>` at /playground/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/panic.rs:396:14
    = note: inside `std::rt::lang_start_internal` at /playground/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/rt.rs:51:25
    = note: inside `std::rt::lang_start::<()>` at /playground/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/rt.rs:65:5

So maybe provide the demo as a playground link, so it's self-contained?

1 Like

This does smell like UB, but what I find even more concerning is the push() method taking &self. Why is that?

Furthermore, if you do indeed need interior mutability over nontrivial types, you should probably try RefCell before reaching for unsafe code.

Well, because the lifetimes of the returned &str is coupled to the lifetime of the StringArena by shared borrow, you cannot require &mut Self without requiring the user to give up their &str from previous allocations. I think this interior mutability approach is not too bad.

Similarly, RefCell won't work because.. well.. multiple reasons, basically: if RefCell worked, OP probably wouldnĘĽt have asked their question in the first place.

Judging by the fact that a String contains a pointer to its data and the call to push on the Vec should never actually be able to dereference the String, I personally donĘĽt think that we have UB here (as long as the len vs len-1 indexing error is fixed, as pointed out by others).

It does feel a bit questionable to be relying on the implementation details of String here. So Iʼm not 100% sure that this will never become UB in the future in case that String changes in a way such that the existence of a &mut String automatically asserts unique access to the contained str without anyone dereferencing it. (Admitted, currently push on a Vec<T> doesnʼt even create a &mut T either as far as Iʼm aware, instead directly using some kind of memcopy in case of resizing, but thatʼs feeling like an implementation detail that could—not realistically but at least possibly—change, too.)

OK so if we are climbing levels of abstraction here, then I feel entitled to say this: what also bothers me is the fact that yet another person is trying to write their own unsafe arena, whereas there are several ready-made solutions for this problem:

If neither of these, nor the myriad of other available allocator/memory manager crates satisfies OP's needs, then they should instead describe the problem they are trying to solve, without immediately jumping onto the unsafe train.

There is a reason this kind of code is not welcome by safe Rust, and unless a highly specialized solution is needed, one should probably rely on (battle-)tested implementations, instead of trying to reinvent the wheel.

4 Likes

I think that this is already the case, since String internally contains a core::ptr::Unique which means it acts like it stores a str inline. To be on the safe side one could store an AliasableString which definitely allows aliasing of the inner str.

1 Like

I was mostly just avoiding to go into details about how you cannot implement the API that OP proposes with RefCell, among other things because there's no place to store the guards.

Anyways, I do stand with your opinion that implementing an arena is not really necessary. And that writing unsafe should be avoided if possible.

And even if reinventing the wheel, I also think that there are also ways to implement even this data structure without doing quite as sketchy stuff: One could, for example, just store the strings in raw form as (*mut u8, usize, usize) triples.

Good point, I forgot that Vec (and thus String) does use Unique. In this case, I think that no &mut String is even created during push/resizing, so the point that the code might not contain any UB still stands.

An &mut Vec<String> is created and since Vec also uses Unique that would imply &mut [String] which in turn would imply &mut String which is then equivalent to &mut str, which is UB because there is already a shared reference to it. At least that's how I think Unique works.

Even an empty Vec<T> contains a Unique<T>, and the Unique<T> in general doesn't really know its own “length”, so I'm not quite sure how exactly all of this works out in detail. It may very well be the case that a &mut Vec<String> invalidates all references obtained from its contents, and in case the contents are Unique pointers themselves, maybe even transitively. But I don't know. Possibly nobody really knows, for there's no (accurate) formal specification of Rust, or LLVM.

1 Like

A few random comments after reading some of the replies:

  • while I agree that one should do some due diligence to see if their problem is solved by an existing battle tested crate, there’s an educational benefit to going through this exercise. And given the ambiguity, subtleness and lack of formalism around issues such as in this thread, it’s quite possible those same battle tested crates are equally broken, they just don’t know it :slight_smile:
  • RefCell could be used here but it won’t obviate the need for unsafe code to return the &str. What it would buy is something like protection against mistakenly mutably borrowing the vec in parallel.
  • I think more canonical solution for interning is returning a handle (eg usize, wrapped via newtype) rather than a reference. Then access the value via the handle.
  • Along similar lines, returning a raw ptr to the str (wrapped in newtype) rather than &str may be less controversial in terms of violating aliasing. The raw ptr would still need to be turned into a reference for use but presumably the scope of those references is much narrower than returning &str outright

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.