Strtol equivalent in Rust?

I’m trying to re-write chibicc (which is a minimalist C compiler) in Rust. I’m already stuck at the second commit that uses strol.

char *p = argv[1];

while (*p) {
    if (*p == '+') {
        p++;
        // Note: a test is missing here to check that *p isn’t \0
        // in case the input was invalid and the next token is missing
        printf("  add $%ld, %%rax\n", strtol(p, &p, 10));
        // another test is missing here to check that stltol parsed a number
        continue;
    }
    // more parsing...
}

The input is assumed to be a valid string 0-terminated utf8 string (I’m ok if my code crash if the input is invalid).

p is a pointer that point to the next byte to read. That’s the first issue, since a byte may not align with the start of a valid utf8 grapheme. Once again, I’m ok to assume that my implementation of the parser is valid, and if not, the program can crash.

p is going to be updated each time a token is consumed. When the length of the token is statically known (like + which is 1 byte), then it’s easy. However, when it’s not statically known, like when using strtol, I don’t know how to convert the C code without high verbosity, and doing the work multiple time in Rust.

  • I’d like to not use libc::strtol.
  • I don’t want my code to be any slower that what was implemented in C (no extra copy, allocation or reading of the input) for anything that isn’t error reporting and undefined behaviors (the C code has missing checks as seen in the snippet).

So far I have:

let input: &str = ...; // assumed to be a valid null terminated utf8 string

let mut index = 0;
while index < input.len() {
    // this line is way too verbose
    if '+' == input.bytes().nth(index).unwrap().into() {
        index += 1;
        assert!(index < input.len(), "unexpected end of input after '+'");
        println!(
            "  add ${}, %rax",
            input[index..]
                .parse::<isize>() // parse() doesn’t return the number of bytes read
                .expect(&format!("expecting a number after '+' at index {}", index))
        );
        // index isn’t updated
        continue;
    }

    // ...
}
  • I don’t know how to update index after parsing the number (since parse() doesn’t return the number of bytes read), unless manually counting the number of digit (or by using the regex crate) which would duplicate the work done by parse().
  • the syntax for accessing bytes is ultra-verbose, and it’s not clear what is going on, while the C version is much clearer (albeit error prone).
  • is &str::len() going to call the equivalent of strlen()? If yes I think I should replace the tests by input.bytes().nth(index).unwrap() != 0 which once again is ultra verbose.

Normally when I'm doing lexical analysis I'll pull the lexer state into its own type.

#[derive(Debug, Clone, PartialEq)]
struct Tokens<'a> {
    src: &'a str,
    cursor: usize,
}

impl<'a> Tokens<'a> {
    fn new(src: &'a str) -> Self { Tokens { src, cursor: 0 } }

    fn rest(&self) -> &'a str { &self.src[self.cursor..] }

    fn peek(&self) -> Option<char> { self.rest().chars().next() }
}

Then I give it some sort of take_while() method which will consume characters as long as a predicate is satisfied, returning the bit of text and its span.

impl<'a> Tokens<'a> {
    fn take_while<P>(
        &mut self,
        mut predicate: P,
    ) -> Option<(&'a str, Range<usize>)>
    where
        P: FnMut(char) -> bool,
    {
        let start = self.cursor;

        while let Some(c) = self.peek() {
            if !predicate(c) { break; }

            self.advance();
        }

        let end = self.cursor;

        if start != end {
            let text = &self.src[start..end];
            Some((text, start..end))
        } else {
            None
        }
    }
}

You can then identify numbers by consuming while c.is_ascii_digit(), then pass that string to parse().

See also the Tokenizing section here:

1 Like