Lifetimes, Strings, slices

I'm trying to work out how to create a struct that:

  • Takes ownership of a String
  • Parses the string
  • Contains a data structure that contains tokens of the String as slices of the owned string.
  • Can return immutable borrows of the String and the tokens with the same lifetime as the struct.

As such, I'm coming up against the fact that I don't really grok lifetimes.

Here's what I've got (this is a toy problem):

struct ParsedString<'input> {
    input: String,
    tokens: Vec<&'input str>,
}

impl <'input> ParsedString<'input> {
    pub fn new(input: &str) -> Self {
        let mut ps = ParsedString {
            input: input.into(),
            tokens: Vec::new(),
        };

        let len = input.len();

        // Pretend to parse the string; really, just save a token.
        if len > 4 {
            ps.tokens.push(&ps.input[0..3]);
        }

        // Compilation error here: 
        // cannot return value referencing local data `ps.input`
        ps
    }

    pub fn input(&self) -> &str {
        &self.input
    }

    pub fn tokens(&self) -> &[&str] {
        &self.tokens
    }
}

Clearly there's a way to do this.

Ultimately I want to use Cow so that the tokens can be taken directly from the input or supplied at parse time.

You're trying to make a self referential struct which is not possible in pure safe rust. There are however crates like ouroboros that wraps the necessary unsafe code for you.

Oh, ugh. This seems like such an obvious use-case, there ought to be a dirt simple way to do it.

Thanks for the mention of Ouroboros.

Alternatively, for your specific case, you could keep tokens as spans of text:

struct ParsedString {
    input: String,
    tokens: Vec<(usize, usize)>,
}

impl ParsedString {
    // there may be better ways to retrieve the tokens 😅
    pub fn tokens(&self) -> Vec<&str> {
        self.tokens
            .iter()
            .map(|(start, end)| &self.input[*start..*end])
            .collect()
    }
}

You could change your struct a bit, and keep the input stored outside

struct ParsedString<'input> {
    input: &'input str,
    tokens: Vec<&'input str>,
}

impl<'input> ParsedString<'input> {
    pub fn new(input: &'input str) -> Self {
        let mut ps = ParsedString {
            input,
            tokens: Vec::new(),
        };

        let len = input.len();

        // Pretend to parse the string; really, just save a token.
        if len > 4 {
            ps.tokens.push(&ps.input[0..3]);
        }

        ps
    }

    pub fn input(&self) -> &'input str {
        self.input
    }

    pub fn tokens(&self) -> &[&'input str] {
        &self.tokens
    }
}
2 Likes

As an alternative to attempting to create a self-referential struct, you could

  • not have your struct own the string, e.g. like this
  • use indices instead of references for the tokens, e.g. like this

Dang it... now that I’m done implementing it, both approaches were already posted xDDD

1 Like

"You could change your struct a bit, and keep the input stored outside"

That just moves the problem up a level.

I've written a TCL interpreter in Rust, called Molt; I'm trying to improve its memory usage. Molt parses a script, yielding a parse tree defined using an enum; and many of the elements in the tree need to reference text from the script. Plus, I need to keep the entire script. At present, the parse tree keeps the tokens as Strings, which (by the nature of TCL but for reasons I won't go into) causes a lot more duplication than you might think. I want to set things up so I have one copy of the script in memory, and the parse tree is built using slices. So no matter how you look at it, I'm going to have a struct that (utlimately) owns both the String and the slices.

Maybe I need to use an interned string solution?

That would work, but it seems ugly, like I'm re-implementing the notion of a slice.

Another option is to use Rc, and let every token own a handle to it.

use std::ops::{Range, Deref};
use std::rc::Rc; // or Arc

struct Token {
    input: Rc<str>,
    span: Range<usize>
}

impl Deref for Token {
    type Target = str;
    fn deref(&self)->&str { &self.input[self.span.clone()] }
}

struct ParsedString {
    input: Rc<str>,
    tokens: Vec<Token>,
}

impl ParsedString {
    pub fn new(input: impl Into<String>) -> Self {
        let mut ps = ParsedString {
            input: input.into().into_boxed_str().into(),
            tokens: Vec::new(),
        };

        let len = ps.input.len();

        // Pretend to parse the string; really, just save a token.
        if len > 4 {
            ps.tokens.push(Token {
                input: ps.input.clone(),
                span: 0..3
            });
        }

        // Compilation error here: 
        // cannot return value referencing local data `ps.input`
        ps
    }

    pub fn input(&self) -> &str {
        &self.input
    }

    pub fn tokens(&self) -> impl Iterator<Item = &Token> {
        self.tokens.iter()
    }
    
    pub fn into_tokens(self) -> impl Iterator<Item = Token> {
        self.tokens.into_iter()
    }
}

(Playground -- based on @steffahn's code)

2 Likes

This is sounding like what I'd have to do; it's a reasonable extension of what's already in the interpreter. I have a Value struct; it currently stores Strings as Rc<String> so that I can copy the value without duplicating its content. I could also have a flavor that includes a range. Copy the string, set the range. (Tokens in the parse tree are often stored as Values.

Also note that a range is two usizes, or 16 bytes. A reference will be another 8. How big are these tokens? Would something like SmallString be a viable alternative?

The problem is that this is a pretty complex topic that doesn't play well with the borrow checker. The fact that an instance of a struct may be moved at any time means that you can't store a reference to a field because that could be invalidated at any time. You need an ad hoc language feature for this, but this requires considering may things:

  • Does it need to work only when the borrowed type Derefs to a stable location or can it be just any type?
    • If only those that Derefs to a stable location do you require an unsafe trait like StableDeref or can you recognize them automatically?
    • If any type, how do you ensure the struct can't be moved when it borrows non-stable Deref fields? How does it integrate with the existing Pin?
  • How do you recognize which fields can be borrowed immutably/mutably or can't be borrowed at all (because already borrowed)?
  • Can you prove it is 100% safe? For example the existing self-referential structs don't play well with stacked borrows and as such they're technically unsound.

Overall it is a feature that may come in handy but nobody has yet proposed a safe abstraction for it that was accepted.

2 Likes

I would argue that it restores the problem to the level where it actually appears: at the point where you create something and have to decide whether to pass ownership into it or have it borrow. This is the only point where you can meaningfully attach a lifetime to something; once you've made the decision to own, you can't change your mind and borrow it deeper in the stack, because lifetime analysis is static, not dynamic.

Arc is a way of making lifetime analysis partially dynamic, at the cost of moving it to runtime -- you have to do some manual cloning. Cow is a different take on that; I'm not sure if it would work in your situation. Changing the struct to always borrow, on the other hand, is a different kind of concession: keep the static lifetime analysis, and accept that you need to change the structure of your code to prove to the compiler it's correct.

(I will break from the party line long enough to point out that it is possible to create and use self-referential structs in safe Rust, but you have to accept constraints, like that they can't be borrowed mutably or moved, and you can't create one and initialize it in a single line.)

None of these things is wrong or right in the general case. You have to decide what the correct tradeoff is for your application. Bear in mind in another language you'd just be building on top of that language's string type and garbage collector (in e.g. Java) or writing unsafe code without any compiler guidance (e.g. C++).

4 Likes

In case anyone is interested, I’ve been playing around with potential API for such a Token type for the last hour. (I renamed it RcStringSlice, which isn’t a perfect name either.)

1 Like

I managed to turn your code into RcSlice<str>, and added support for both RcSlice<[T]> and RcSlice<[T;N]>.

1 Like

I have a crate implementing something similar (but only for "true" slices, not str):

https://crates.io/crates/rc_slice

(If anyone thinks this is potentially useful I'd love to hear your feedback, so far I've just been developing it to use in a personal project.)

1 Like