How to write a fast parser in idiomatic rust

I am trying to write a fast parser for a toy lisp language and I feel like it is not clicking. I want to take advantage of rusts borrow checker and not allocate strings but instead collect slices to the original string. But I have already stepped into unsafe code several times to get what I want. In particular I don't like that my function parse_symbol is returning a lifetime that is not bound to an input parameter. Because the string I am taking a slice of is not visible to the function. I am not sure how rust is even validating that.

Also I could not find any way to move an iterator backwards. I ended up creating my own prev iterator function, but I feel like I am missing something obvious here about how to do this with idiomatic rust. I don't want to use a Peekable iterator because that means that I am essentially parsing my entire file twice, including rescanning for code point boundaries.

So really two questions here

  1. How do I create a non-allocating parser without resorting to unsafe to fake the lifetimes?
  2. How to structure my parser so that it is idiomatic and also fast (i.e. No Peekable)?
use std::str;
use std::slice;

trait PrevIter {
    fn prev(&mut self);
}

impl PrevIter for str::Chars<'_> {
    fn prev(&mut self) {
        let orig_str = self.as_str();
        let orig_ptr = orig_str.as_ptr();
        let char_len = unsafe {
            const MAX_UTF8_CHARS: usize = 4;
            let tmp_ptr = orig_ptr.sub(MAX_UTF8_CHARS);
            let tmp_slice = slice::from_raw_parts(tmp_ptr, MAX_UTF8_CHARS);
            let mut iter = str::from_utf8_unchecked(tmp_slice).chars();
            iter.next_back();
            MAX_UTF8_CHARS - iter.as_str().len()
        };
        *self = unsafe {
            let new_len = orig_str.len() + char_len;
            let new_slice = slice::from_raw_parts(orig_ptr.sub(char_len), new_len);
            str::from_utf8_unchecked(new_slice).chars()
        };
    }
}

pub fn parse<'a>(sexp: &'a str) -> Vec<&'a str> {
    let mut chars = sexp.chars();
    let mut lexems: Vec<&str> = Vec::new();
    while let Some(chr) = chars.next() {
        match chr {
            '(' => {
                lexems.push("(");
            },
            ')' => {
                lexems.push(")");
            }
            ' ' | '\t' => {
                lexems.push("space");
            }
            '\'' => {
                lexems.push("quote");
            }
            _   => {
                chars.prev();
                let symbol = parse_symbol(&mut chars);
                println!("\"{}\"", symbol);
                lexems.push(symbol);
            }
        }
    }
    return lexems;
}

fn symbol_char(char: char) -> bool {
    match char {
        '\x00'..=' ' |
        '(' | ')' | '[' | ']' |
        '#' | ',' | '.' | '`' |
        ';' | '"' | '\'' | '\\' => false,
        _ => true,
    }
}

unsafe fn string_from_ptrs<'a>(beg: *const u8, end: *const u8) -> &'a str {
    let len = end as usize - beg as usize;
    let slice = std::slice::from_raw_parts(beg, len);
    str::from_utf8_unchecked(slice)
}

fn parse_symbol<'a>(chars: &mut str::Chars) -> &'a str {
    let mut escaped = false;
    let beg_str = chars.as_str();

    while let Some(char) = chars.next() {
        if escaped || char == '\\' {
            escaped = !escaped;
        } else if !symbol_char(char) {
            unsafe {
                chars.prev();
                let end = chars.as_str().as_ptr();
                return string_from_ptrs(beg_str.as_ptr(), end);
            }
        }
    }
    unsafe {
        let slice = std::slice::from_raw_parts(beg_str.as_ptr(), beg_str.len());
        str::from_utf8_unchecked(slice)
    }
}

fn main() {
    let symbols = parse("(foo (bar) baz 'word) bob");
    for s in symbols {
        println!("\"{}\"", s);
    }
}

#[cfg(test)]
mod test {
    macro_rules! vec_of_strings {
        ($($x:expr),*) => (vec![$($x.to_string()),*]);
    }

    #[test]
    fn parse() {
        let symbols = super::parse("(foo (bar) baz 'word) bob");

        let golden = vec_of_strings![
            "(", "foo", "space", "(", "bar", ")", "space",
            "baz", "space", "quote", "word", ")", "space",
            "bob"
        ];

        assert_eq!(golden, symbols);
    }
}

2 Likes

So first off, you don't need unsafe for this. Using raw pointers to try and "extend" a lifetime because the borrow checker doesn't like your code should set off alarm bells.

When I do something like this I'll split it into two stages, lexical analysis and parsing.

The lexer is where the interesting no-copy comes in.

struct Lexer<'a> {
    src: &'a str,
    cursor: usize,
}

impl<'a> Lexer<'a> {
    pub fn new(src: &'a str) -> Self {
        Lexer {
            src, 
            cursor: 0,
        }
    }
    
    pub fn rest(&self) -> &'a str {
        &self.src[self.cursor..]
    }
}

I'll also define a Token for the individual "atoms" of my language (identifiers, string literals, number literals, etc.) and Span for keeping track of where things are (helps when reporting errors).

#[derive(Debug)]
struct Span {
    start: usize,
    end: usize,
}

#[derive(Debug)]
enum Token<'a> {
    Identifier(&'a str),
    Integer(i64),
}

And finally, my Lexer is just an iterator which yields either Tokens or a ParseError.

static WHITESPACE: Lazy<Regex> = Lazy::new(|| Regex::new(r"^\s*").unwrap());
static IDENTIFIER: Lazy<Regex> = Lazy::new(|| Regex::new(r"^[\w_][\w\d_]*").unwrap());
static INTEGER: Lazy<Regex> = Lazy::new(|| Regex::new(r"^\d+").unwrap());

impl<'a> Iterator for Lexer<'a> {
    type Item = Result<(Token<'a>, Span), ParseError>;

    fn next(&mut self) -> Option<Self::Item> {
        // skip leading whitespace
        let whitespace_characters = WHITESPACE
            .find(self.rest())
            .map(|m| m.as_str().len())
            .unwrap_or(0);
        self.cursor += whitespace_characters;

        if let Some(m) = INTEGER.find(self.rest()) {
            // We found an integer!
            let start = self.cursor;
            let text = m.as_str();
            let end = self.cursor + text.len();
            self.cursor = end;

            match m.as_str().parse() {
                Ok(integer) => return Some(Ok((Token::Integer(integer), Span { start, end }))),
                Err(e) => return Some(Err(ParseError)),
            }
        }

        if let Some(m) = IDENTIFIER.find(self.rest()) {
            // We found an identifier
            let start = self.cursor;
            let text = m.as_str();
            let end = self.cursor + text.len();
            self.cursor = end;

            // the text comes from our original string, no copies!
            return Some(Ok((Token::Identifier(text), Span { start, end })));
        }

        if self.rest().is_empty() {
            None
        } else {
            // We've found characters we don't recognise
            Some(Err(ParseError))
        }
    }
}

The code is quick and hacky, but it won't panic and doesn't need any unsafe.

(playground)

From there your Parser can either iterate over the Lexer (which will lazily find the next token when needed) and parse the tokens using a typical parsing algorithm (e.g. recursive descent) or collect them into a Vec<_> so your Parser can jump back and forth through the tokens (e.g. if you need lookahead or backtracking).

Collecting into a Vec may actually be faster because it simplifies your logic (no need to handle lexer errors whenever you reach for the next token) and may allow more/better optimisations. Standard disclaimer applies - don't blindly trust performance advice from random people on the internet, benchmark your code if you care about performance.

I don't quite know where this comment comes from, Peekable isn't slow and doesn't require any dynamic allocations (it's available for #[no_std] crates). See for yourself.

7 Likes

Thanks for this fantastic reply! This is very helpful. It shows that I was thinking in the wrong direction.

I don't quite know where this comment comes from, Peekable isn't slow and doesn't require any dynamic allocations

This come for the fact that a peekable iterator requires you to scan your document twice (once for peek and once for next). This means you are scanning for UTF-8 boundaries twice for every character.

No, have a look at the source code I linked. When you peek() it'll pull the next item from the wrapped iterator and stash it away to be yielded when you call next() on the Peekable.

1 Like

No, have a look at the source code I linked. When you peek() it'll pull the next item from the wrapped iterator and stash it away to be yielded when you call next() on the Peekable .

Touche. I feel foolish that I just assumed how it worked. I will know to go look at the source code next time before making assumptions!

1 Like

This also makes sense when you think about it. What happens if iterating has a side-effect like adding an entry to a database?

Also, here's an experiment:

struct CountDown {
    number: usize,
}

impl Iterator for CountDown {
    type Item = usize;
    
    fn next(&mut self) -> Option<Self::Item> {
    if self.number == 0 {
        return None;
    }
    self.number -= 1;
    
        let number = self.number;
        println!("Generating {}", number);
        
       Some(number)
    }
}

fn main() {
    let mut counter = CountDown { number: 5 }.peekable();
    
    while let Some(number) = counter.next() {
        println!("Got {}", number);
        
        if number % 2 == 0 {
            println!("Peeking {:?}", counter.peek());
        }
    }
}

Which outputs

Generating 4
Got 4
Generating 3
Peeking Some(3)
Got 3
Generating 2
Got 2
Generating 1
Peeking Some(1)
Got 1
Generating 0
Got 0
Peeking None

(playground)

2 Likes