Managing strings with different lifetimes in the same collections? Throw it all in an Arena...or..?

Howdy all. I posted this on r/rust but figured I'd post here, too. I hope that's alright.

I'm working through some lifetime woes.

The project is a compiler. The problem is managing references to original source code text. I'll do my best to summarize the issue, and then I have a minimal repro playground link.

The Issue
The compiler is invoked upon a file. That file has a lifetime 'a. That's not negotiable as we have a lot of tooling that calls fn compile() and needs that behavior.

Now, if the user declares an additional module within their code, then the compiler, during parsing, has to go fetch that file. This is code that is loaded during the invocation of the compiler. Let's call that lifetime 'b.

I have an enum that describes these two files. However, I am having difficulty reconciling the asymmetry of the lifetimes into one clean interface. I thought about using an Arena, which I saw in a previous project, to clean up all the lifetimes and unify them. I also thought about writing a custom data structure which holds owned copies of the source and hands out temporary references, where that owning data structure is either static or global. Neither of these is appealing, however, as they would require a significant rewrite of existing infrastructure and they feel...hacky to me.

Is there another recommended pattern to handle situations like this? To be clear, I understand why the current implementation doesn't work, but I don't know what the cleanest situation for my particular case is.

The repro
Here's a rough minimal repro: Rust Playground

The project is open source. If you'd like to just see my actual branch and assess the situation, here's the line and branch with the issue:

You need to have some lower bound on all lifetimes in a collection. An arena doesn't change that. Another way to say this is that you aren't allowed to use dangling pointers. If you use an arena as an alternative heap, it can work. But this is a lot different from managing "strings with different lifetimes", which I assume is &'a str, &'b str, etc. where neither 'a nor 'b is a subtype of the other.

This is basically what I meant about using an arena as an alternative heap.

The usual recommendation is to borrow only temporarily. But...

impl <'a> Scanner<'a> {
    fn advance(&mut self) {
        self.lexer = Lexer::new(self.input[self.current_file].as_str())

self.input[self.current_file].as_str() is a temporary loan that is dropped when the stack frame ends. The way to make this work is "tying the lifetimes together", specifically the advance function would have to take &'a self instead of &mut self. (The reason for this is invariance in &mut T.) And that implies interior mutability... Which you can make work if you want to go that route.

On the other hand, this does kind of resemble an arena already. I mean if you strip away all the fancy stuff, an arena is basically just a Vec<T> that hands out loans. Replacing the vector with an arena doesn't really seem that farfetched to me.

edit: I just noticed the signature on the as_str method:

fn as_str<'b>(&'b self) -> &'a str where 'b: 'a {}

This is the same as [1]:

fn as_str(&'a self) -> &'a str {}

  1. I feel like I'm going to regret saying this when someone comes along to correct me. (quinedot lol :heart:) But I would like to know if my intuition is incorrect. ↩︎

1 Like

When SourceFile is the String variant, you're trying to create a self-referential struct. You can create one in safe Rust, but it pins it to the stack and is thus usually too inflexible. ("Usually" to the extent that I don't think I've seen a serious use of it. You'd probably need something like a visitor pattern.)

You beat me to it actually, heh. The signatures are effectively the same since the returned &str is covariant and you have to borrow self for the longer lifetime no matter what (in practice the lifetimes will be the same unless you go ham with unnecessary lifetime annotations that make things worse at your call site).

1 Like

I thought so, too. But that doesn't appear to be the case (in the playground example). I'm able to push owned and borrowed variants and advance the lexer without problems. And it makes sense. The 'a lifetime is bound to the owner of Scanner, not to Scanner itself.

What do you know that I don't know? :grin:

Oh, of course. Because I've already changed the signature for advance to borrow self for 'a. You're absolutely right. Without that change, this would attempt to create a self-referential struct.

Please demonstrate. (My guess is that you did your pushing before any advance, and none after.)

With the change to make the lifetimes the same so that things compile, you do create a self-referential struct.[1] In order for that to be sound, the lifetimes have to be the same, so that it gets borrowed forever, so that it gets pinned in place.

If it didn't get pinned in place, you could move it when the outer lifetime expires. If you can move it, you can mutate it. If you can mutate it, you can clear the Vec for example; if you can do that, you can definitely trigger UB.

  1. Literally when the variant is a String, and always as far as the compiler can tell, not that it sees it that way exactly. ↩︎

Welp, always a pleasure to learn from you! Indeed, that was my mistake. Borrowing for 'a in advance just cannot work.

But it hadn't occurred to me that this was equivalent to a self-reference. I think it makes sense.


I know you said it's non-negotiable, but it feels like a lot of your pain is coming from this design decision.

Tools like rust-analyzer tend to use a reference-counted string with structural sharing[1] because it avoids the viral nature of lifetimes and the self-referential structs you get at higher levels when you have more complex use cases.

Maybe if you've already got to add loads of hacks and refactor things to introduce an arena, it might be feasible to change the type you use for storing source code/tokens and side-step these lifetime issues altogether.

  1. think Arc<str>, but where slicing gives you a Arc<str> that points into the first one. ↩︎

And it's solved by not managing references to the original source text.

If you are writing a compiler and need to keep multiple source files around, and then get regions of them (I guess for quoting the source in spanned error messages), then do the following:

  • Create a "source cache" type that the compiler can access to lazy-load and then store source files
  • The cache should identify source files by some unique ID that's stored and returned along the source text
  • When lexing/parsing, only store integer byte indices or ranges pointing into the source, not real references
  • Additionally, these ranges should store the source id
  • When you need a substring from a file, use the source ID to look it up in the cache, and use the integer byte range to create a temporary rererence to the substring.

Incidentally, this anti-pattern usually comes up in the form of &'a mut Thing<'a>. This thread with the interior mutability version demonstrates that &'a Thing<'a> can be somewhat more flexible (you can still use Thing<'a> in a non-exclusive manner).

However, it's still a form of borrowing something forever,[1] because Thing<'a> is invariant in 'a now. Why? Because RefCell<T> is invariant in T. Why? Because it can grant exclusive access, and if it wasn't invariant, that would be unsound! You could write my segfaulting version without unsafe if it was covariant, for example.

Something else to add to that guide :slight_smile:

  1. and thus not really useful ↩︎


While people usually suggest saying/writing "exclusive" instead of "mutable", the actual problem here is not exclusivity, it is mutability. The reason why mutable references (either shared or exclusive) must be invariant is because they allow for bidirectional data flow. That is due to mutability, not exclusivity. In fact the whole point of (Ref)Cell is to allow shared mutability, not "shared exclusivity" (which looks like an oxymoron to me).