Rust lifetimes explaination needed for my specific usecase

I store mutable references in a struct but to that you need lifetimes for those references inside your struct but this creates many problems.

here's the minimal reproducible example of my code

use std::{collections::HashSet, str::Chars};

pub struct Token<'a> {
    pub lexeme: &'a str,
    pub token_type: &'static TokenType,
    pub start: usize,
    pub end: usize,
}

impl<'a> Token<'a> {
    pub fn new(
        token_type: &'static TokenType,
        lexeme: Option<&'a str>,
        start: usize,
        end: usize,
    ) -> Self {
        Self {
            lexeme: lexeme.unwrap_or(token_type.lexeme),
            token_type,
            start,
            end,
        }
    }
}

#[derive(Debug)]
pub struct TokenType {
    pub lexeme: &'static str,
    pub name: &'static str,
    pub precedence: u32,
}

impl TokenType {
    const fn new(lexeme: &'static str, name: &'static str, precedence: u32) -> Self {
        Self {
            lexeme,
            name,
            precedence,
            // id,
        }
    }

    // special types
    pub const EOF: TokenType = TokenType::new("", "EOF", 0);
    pub const ERROR: TokenType = TokenType::new("", "ERROR", 0);
}
struct Source<'a> {
    src: Chars<'a>,
    size: usize,
    offset: usize,
    start: usize,
}

impl<'a> Source<'a> {
    fn new(src: &'a str, size: usize) -> Self {
        Self {
            src: src.chars(),
            size,
            offset: 0,
            start: 0,
        }
    }

    fn advance_start(&mut self) {
        self.start = self.offset;
    }
}

pub struct ScannerResult<'a> {
    line_starts: Vec<usize>,
    tokens: Vec<Token<'a>>,
}

impl<'a> ScannerResult<'a> {
    pub fn new() -> Self {
        Self {
            line_starts: Vec::new(),
            tokens: Vec::new(),
        }
    }

    pub fn push_token(&mut self, token: Token<'a>) -> Option<&Token<'a>> {
        self.tokens.push(token);
        return self.tokens.last();
    }

    pub fn push_line_start(&mut self, line_start: usize) {
        self.line_starts.push(line_start);
    }
}

pub struct Scanner<'a, 'b>
where
    'b: 'a,
{
    src: Source<'a>,
    string_table: &'b mut HashSet<String>,
    result: ScannerResult<'b>,
    has_more_tokens: bool,
}

impl<'a, 'b> Scanner<'a, 'b>
where
    'b: 'a,
{
    pub fn new(src: &'a str, size: usize, string_table: &'b mut HashSet<String>) -> Self {
        Self {
            src: Source::new(src, size),
            string_table,
            result: ScannerResult::new(),
            has_more_tokens: true,
        }
    }
}

impl Scanner<'_, '_> {
    fn error_token(&mut self, message: &'static str, c: char) -> Option<&Token> {
        let msg = format!("{}: '{}'", message, c);
        self.tokenize(&TokenType::ERROR, Some(msg))
    }

    fn intern_str(string_table: &mut HashSet<String>, string: String) -> Option<&str> {
        if !string_table.contains(&string) {
            string_table.insert(string.to_owned());
        }
        string_table.get(&string).map(|s| s.as_str())
    }

    fn tokenize(
        &mut self,
        token_type: &'static TokenType,
        lexeme: Option<String>,
    ) -> Option<&Token> {
        let lexeme = match lexeme {
            Some(lexeme) => Self::intern_str(self.string_table, lexeme),
            None => None,
        };
        let token = Token::new(token_type, lexeme, self.src.start, self.src.offset);
        self.src.advance_start();
        self.result.push_token(token)
    }
}

When compiling this code you'd get this error

error: lifetime may not live long enough
   --> src\lib.rs:135:29
    |
130 |         &mut self,
    |         ---------
    |         |
    |         let's call the lifetime of this reference `'1`
    |         has type `&mut Scanner<'_, '2>`
...
135 |             Some(lexeme) => Self::intern_str(self.string_table, lexeme),
    |                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ argument requires that `'1` must outlive `'2`

The thing i don't understand about this error is since i am not explicitly using any lifetime parameters self is &mut Scanner<'_, '_> but the later '_ lives more than the former as i have specified in the Scanner struct, so why can't it auto infer that intern_str would return a string with lifetime of '2 and since token also needs the same lifetime, it would pass and compile.

Why must the reference to self of Scanner live more than the HashSet reference in the Scanner when that should not be case in the first place. The HashSet will always outlive the Scanner in my code.

The code can be fixed by adding explicit lifetime parameters but then code i omitted will not compile, like this


impl<'a, 'b> Scanner<'a, 'b> {
    pub fn scan(&'b mut self) {
        loop {
            let next: Option<&Token<'_>> = self.next();
            match next {
                Some(tok) => {
                    let tok_type = tok.token_type;
                    if tok_type == &TokenType::EOF {
                        break;
                    }
                }
                None => break,
            }
        }
    }

    pub fn next(&'b mut self) -> Option<&Token<'b>> {
        self.tokenize(&TokenType::EOF, None)
    }
}

New error

error[E0499]: cannot borrow `*self` as mutable more than once at a time
   --> src\lib.rs:126:44
    |
123 | impl<'a, 'b> Scanner<'a, 'b> {
    |          -- lifetime `'b` defined here
...
126 |             let next: Option<&Token<'_>> = self.next();
    |                                            ^^^^-------
    |                                            |
    |                                            `*self` was mutably borrowed here in the previous iteration of the loop
    |                                            argument requires that `*self` is borrowed for `'b`

Due to invariance of 'b in &'b mut Scanner<'_, 'b>, you are creating a type that is borrowed forever, which is a pretty useless affair. Can you store string_table in Scanner as an owned HashSet or behind a shared ownership construct like Rc<RefCell<HashSet>>, instead of behind a mutable reference?

it is very tricky to store references in struct fields, it is especially hard to handle mutable references, and it is even harder if you use multiple lifetimes.

some of the lifetimes in your type definition can be eliminated (or rather, erased). for example,the Source type needs the lifetime only because you store a Chars<'_> inside. you can use a generic type instead of a generic lifetime:

struct Source<CharIter: Iterator<Item = char>> {
    src: CharIter,
    //...
}

similarly, you can replace the generic lifetime with a generic type for the ScannerResult type:

pub struct ScannerResult<Token> {
    line_starts: Vec<usize>,
    tokens: Vec<Token>,
}

granted, you still need to propagate the generic parameters whenever you use the type, but a type parameter should be easier to deal with than a lifetime parameter.

as for the Scanner type, you are actually trying to create a self referential type:

with this code, lexeme borrows the string table:

and then this line indicates token also borrows the string table:

and finally, you push the token into self.result, making self.result borrowing self.string_table:

yet, self.string_table is an exclusive reference, you cannot have other references pointing to it.

your design of storing a &str inside the Token cannot be implemented safely with the way how you store the internalized strings. the address of the str slices from the string table is NOT stable, when the string table rehashes, all the previous tokens would be dangling.

the solution? either change the token type (e.g. ditch the pointers all together and only use the indices), or change the string table (with guaranteed stable address for inserted values -- data structures commonly seen in functional languages, such as persistent hash array mapped trie, but not so common in rust), or both.

if using indices is not enough, I suggest to use a container type with stable keys (one such example is slotmap) for the string table, and store the key in your token type.

2 Likes