Difficulty managing lifetimes with a struct

Hello, I'm a new Rust user (with some C++ experience) who has been working through the scanning chapter in crafting interpreters and I've been struggling with understanding/managing the lifetimes with this code I have. Conceptually, I'm trying to have my Scanner take in a source code string (as a Vec for simplicity) and scan it into a list of tokens. Each token contains a lexeme, a slice pointing within the source vec (since the token doesn't need to own the source).

As a result, the Token has a lifetime, which I named 'source, to match that each token needs to live as long as the source code. In one of my methods in my scanner, scan_to_token, though, I'm trying to create a token with a lifetime of 'source but getting an error. A simplified version of the code is below (playground link):

#[derive(Debug)]
pub enum TokenType {
    LeftParen,
    RightParen,
    Eof,
    // rest ommitted for simplicity
}

#[derive(Debug)]
pub struct Token<'source> {
    pub token_type: TokenType,
    pub lexeme: &'source [u8],
    pub line: usize,
}

pub struct Scanner<'source> {
    source: Vec<u8>,
    tokens: Vec<Token<'source>>,
    start: usize,
    current: usize,
    line: usize,
}

impl<'source> Scanner<'source> {
    pub fn new(source: &'source str) -> Self {
        Self {
            source: source.as_bytes().to_owned(),
            tokens: Vec::new(),
            start: 0,
            current: 0,
            line: 1,
        }
    }

    pub fn scan_tokens(&mut self) {
        while !self.is_at_end() {
            self.start = self.current;

            let next_token = self.scan_to_token();
            self.tokens.push(next_token);
        }

        self.tokens.push(Token {
            token_type: TokenType::Eof,
            lexeme: b"",
            line: self.line,
        })
    }

    fn scan_to_token(&mut self) -> Token<'source> {
        let next_character = self.advance();

        match next_character {
            b'(' => self.create_token_object(TokenType::LeftParen),
            b')' => self.create_token_object(TokenType::RightParen),
            // rest ommitted for simplicity
            _ => panic!("Not handled for now"),
        }
    }

    fn is_at_end(&self) -> bool {
        self.current >= self.source.len()
    }

    fn advance(&mut self) -> u8 {
        let result = *self
            .source
            .get(self.current)
            .expect("out of range scanner index (this should never happen)");

        self.current += 1;

        result
    }

    fn create_token_object(&self, token_type: TokenType) -> Token {
        let text = &self.source[self.start..self.current];

        Token {
            token_type,
            lexeme: text,
            line: self.line,
        }
    }
}

The error is as follows:

   Compiling playground v0.0.1 (/playground)
error: lifetime may not live long enough
  --> src/lib.rs:55:21
   |
24 | impl<'source> Scanner<'source> {
   |      ------- lifetime `'source` defined here
...
50 |     fn scan_to_token(&mut self) -> Token<'source> {
   |                      - let's call the lifetime of this reference `'1`
...
55 |             b')' => self.create_token_object(TokenType::RightParen),
   |                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ method was supposed to return data with lifetime `'source` but it is returning data with lifetime `'1`

What can I do to resolve this? Or better yet, model this in a more idiomatic way? Marking &mut self in the erroring method as &'source mut self creates issues with the loop in the scan_tokens method, so I assume that isn't the right way to go about it.

You're trying to create a self-referential struct. Those aren't safely supported.[1]

Use some sort of span that's not a reference (&_), like a range, or split up your structs so that the borrowing struct is separate from the owning struct.[2] Or there may be other approaches in crates, like reference counted substrings.


  1. in any practically usable way ↩︎

  2. in the latter case, the borrowing struct will need to go away before the owning struct is moved or destroyed ↩︎

3 Likes

This is a self-referential struct because tokens contains slices of (references to) the source field. Self-referential structs are not supported by Rust.

The usual approach/solution is to use integer offsets or offset-length pairs instead of a slice. std::ops::Range can often be used for this. So something like this:

pub struct Token {
    pub token_type: TokenType,
    pub lexeme: Range<usize>,
    pub line: usize,
}

playground

Note that the 'source lifetime is no longer needed, which is a good thing.

1 Like

This has a few problems.

The first is that you are creating a self-referential type. Scanner cannot have fields that refer to other fields on itself without some gnarly workarounds. The lifetime annotation is telling the holder of Scanner that this type refers to something else which is owned elsewhere.

To fix this, Scanner can hold an &'source [u8] rather than an owned copy like Vec<u8>. That will allow the token vector to reference the same backing buffer.

The second is that you are possibly not handling UTF-8 appropriately. [1] I understand the point of Crafting Interpreters is to teach how things like a lexer works -- I'm reading the book myself. But it's disappointing that the lox language spec only supports ASCII. This is an improvement that you could consider in your implementation.

Third, you have a very common mistake with lifetime elision:

fn create_token_object(&self, token_type: TokenType) -> Token

This method signature elides a lifetime on Token<'_>, which the compiler desugars as something like:

fn create_token_object<'a>(&'a self, token_type: TokenType) -> Token<'a>

Definitely not what you intended! You can enable the #![deny(elided_lifetimes_in_paths)] lint to catch this kind of error.

Altogether, I made four changes to the playground example to get it to compile. I added comments where appropriate, mentioning whether something has been changed or added.


  1. You may want to use unicode-segmentation. ↩︎

3 Likes

Thanks for clarifying the multiple issues! Just a follow-up question regarding UTF-8, since I did (eventually) plan on adding that support. I chose to go with Vec<u8> for now because I wasn't sure how to deal with Unicode issues, character boundaries, indexing within the slices in O(1) time, etc. Would I have to use the external crate you mentioned for this purpose? What sort of thing did you end up doing for this?

unicode-segmentation is the best crate for breaking strings into grapheme clusters (as far as I know), which is the smallest unit of text that you typically consider a "character". Unicode is a very complex specification (reflecting the inherent complexity of written human languages), but it's nice that you can almost ignore all of the details just by treating UTF-8 as sequences of grapheme clusters.

I'm by no means an expert in human languages, and there are probably counterexamples that show it doesn't work 100% correctly in all cases. But I haven't seen any situations in the wild where a grapheme cluster objectively wasn't what you would intuitively define as a character.

They are substrings, so they map directly to &str. That means they can trivially reference an existing string without copying. They are always guaranteed to be within character boundaries. And so forth.

Constant time indexing is a non-goal, IMHO. Grapheme clusters are not fixed sized. You will need an iterator approach to support Unicode regardless of the encoding. If indexing is critical for performance (maybe you really truly need random access?) then it might be worth collecting the iterator into a Vec<&str>. That has constant-time indexing and each element is a complete grapheme cluster:

I ... haven't got that far in the book, yet! :sob: Chapter 4 sent me down a rabbit hole trying to fix the positions of some of the aside notes that get clumped together at the top of some of the pages. (I'm using Stylish to do local CSS fixes. I don't know the book's build system well enough to submit any patches.) But I've fixed enough chapters by now that I can finally start reading chapter 4 without being stressed out about the side notes.

Personally, I might start with unicode-segmentation instead of a byte-oriented parser. But ultimately, I want to replace the hand-coded parser with something like pest. I think the EBNF-like grammatical rules are much better for documenting the language grammar than a bunch of code.

2 Likes