I am trying to write a fast parser for a toy lisp language and I feel like it is not clicking. I want to take advantage of rusts borrow checker and not allocate strings but instead collect slices to the original string. But I have already stepped into unsafe code several times to get what I want. In particular I don't like that my function parse_symbol
is returning a lifetime that is not bound to an input parameter. Because the string I am taking a slice of is not visible to the function. I am not sure how rust is even validating that.
Also I could not find any way to move an iterator backwards. I ended up creating my own prev
iterator function, but I feel like I am missing something obvious here about how to do this with idiomatic rust. I don't want to use a Peekable
iterator because that means that I am essentially parsing my entire file twice, including rescanning for code point boundaries.
So really two questions here
- How do I create a non-allocating parser without resorting to
unsafe
to fake the lifetimes? - How to structure my parser so that it is idiomatic and also fast (i.e. No
Peekable
)?
use std::str;
use std::slice;
trait PrevIter {
fn prev(&mut self);
}
impl PrevIter for str::Chars<'_> {
fn prev(&mut self) {
let orig_str = self.as_str();
let orig_ptr = orig_str.as_ptr();
let char_len = unsafe {
const MAX_UTF8_CHARS: usize = 4;
let tmp_ptr = orig_ptr.sub(MAX_UTF8_CHARS);
let tmp_slice = slice::from_raw_parts(tmp_ptr, MAX_UTF8_CHARS);
let mut iter = str::from_utf8_unchecked(tmp_slice).chars();
iter.next_back();
MAX_UTF8_CHARS - iter.as_str().len()
};
*self = unsafe {
let new_len = orig_str.len() + char_len;
let new_slice = slice::from_raw_parts(orig_ptr.sub(char_len), new_len);
str::from_utf8_unchecked(new_slice).chars()
};
}
}
pub fn parse<'a>(sexp: &'a str) -> Vec<&'a str> {
let mut chars = sexp.chars();
let mut lexems: Vec<&str> = Vec::new();
while let Some(chr) = chars.next() {
match chr {
'(' => {
lexems.push("(");
},
')' => {
lexems.push(")");
}
' ' | '\t' => {
lexems.push("space");
}
'\'' => {
lexems.push("quote");
}
_ => {
chars.prev();
let symbol = parse_symbol(&mut chars);
println!("\"{}\"", symbol);
lexems.push(symbol);
}
}
}
return lexems;
}
fn symbol_char(char: char) -> bool {
match char {
'\x00'..=' ' |
'(' | ')' | '[' | ']' |
'#' | ',' | '.' | '`' |
';' | '"' | '\'' | '\\' => false,
_ => true,
}
}
unsafe fn string_from_ptrs<'a>(beg: *const u8, end: *const u8) -> &'a str {
let len = end as usize - beg as usize;
let slice = std::slice::from_raw_parts(beg, len);
str::from_utf8_unchecked(slice)
}
fn parse_symbol<'a>(chars: &mut str::Chars) -> &'a str {
let mut escaped = false;
let beg_str = chars.as_str();
while let Some(char) = chars.next() {
if escaped || char == '\\' {
escaped = !escaped;
} else if !symbol_char(char) {
unsafe {
chars.prev();
let end = chars.as_str().as_ptr();
return string_from_ptrs(beg_str.as_ptr(), end);
}
}
}
unsafe {
let slice = std::slice::from_raw_parts(beg_str.as_ptr(), beg_str.len());
str::from_utf8_unchecked(slice)
}
}
fn main() {
let symbols = parse("(foo (bar) baz 'word) bob");
for s in symbols {
println!("\"{}\"", s);
}
}
#[cfg(test)]
mod test {
macro_rules! vec_of_strings {
($($x:expr),*) => (vec![$($x.to_string()),*]);
}
#[test]
fn parse() {
let symbols = super::parse("(foo (bar) baz 'word) bob");
let golden = vec_of_strings![
"(", "foo", "space", "(", "bar", ")", "space",
"baz", "space", "quote", "word", ")", "space",
"bob"
];
assert_eq!(golden, symbols);
}
}