I’m trying to re-write chibicc (which is a minimalist C compiler) in Rust. I’m already stuck at the second commit that uses strol.
char *p = argv[1];
while (*p) {
if (*p == '+') {
p++;
// Note: a test is missing here to check that *p isn’t \0
// in case the input was invalid and the next token is missing
printf(" add $%ld, %%rax\n", strtol(p, &p, 10));
// another test is missing here to check that stltol parsed a number
continue;
}
// more parsing...
}
The input is assumed to be a valid string 0-terminated utf8 string (I’m ok if my code crash if the input is invalid).
p is a pointer that point to the next byte to read. That’s the first issue, since a byte may not align with the start of a valid utf8 grapheme. Once again, I’m ok to assume that my implementation of the parser is valid, and if not, the program can crash.
p is going to be updated each time a token is consumed. When the length of the token is statically known (like + which is 1 byte), then it’s easy. However, when it’s not statically known, like when using strtol, I don’t know how to convert the C code without high verbosity, and doing the work multiple time in Rust.
- I’d like to not use
libc::strtol. - I don’t want my code to be any slower that what was implemented in C (no extra copy, allocation or reading of the input) for anything that isn’t error reporting and undefined behaviors (the C code has missing checks as seen in the snippet).
So far I have:
let input: &str = ...; // assumed to be a valid null terminated utf8 string
let mut index = 0;
while index < input.len() {
// this line is way too verbose
if '+' == input.bytes().nth(index).unwrap().into() {
index += 1;
assert!(index < input.len(), "unexpected end of input after '+'");
println!(
" add ${}, %rax",
input[index..]
.parse::<isize>() // parse() doesn’t return the number of bytes read
.expect(&format!("expecting a number after '+' at index {}", index))
);
// index isn’t updated
continue;
}
// ...
}
- I don’t know how to update index after parsing the number (since
parse()doesn’t return the number of bytes read), unless manually counting the number of digit (or by using the regex crate) which would duplicate the work done byparse(). - the syntax for accessing bytes is ultra-verbose, and it’s not clear what is going on, while the C version is much clearer (albeit error prone).
- is
&str::len()going to call the equivalent ofstrlen()? If yes I think I should replace the tests byinput.bytes().nth(index).unwrap() != 0which once again is ultra verbose.