I'm going through The Elements of Computing Systems right now. In chapter six, the task is to write an assembler in the language of your choice. I'm doing it in Rust.
I already have a (horribly hacky) parser for the specified assembly language, but I'd like to do it a little more properly. To that end, I thought I'd use the nom crate to build the parser (I realize that this is probably massive overkill for writing an assembler, but I want to learn to do things properly in Rust while I'm at it). Here's where I'm at: Every command instruction in the assembly language contains one of eight jump strings, specifying whether to perform a jump if the output of the command is less than, equal to, or greater than zero. For instance, "JNE" means "jump if the output is not equal to zero (i.e. either less than or greater than)".
I would like to parse the eight different strings into a struct
Jump { ng: bool, zr: bool, ps: bool }
and I can basically see two ways to accomplish this using nom.
Write a parsing function, i.e. a function jump(input: &str) -> IResult<&str, Jump>, by hand. This would mean that I have to implement the error handling myself.
Write a separate parser for each of the eight strings and combine them, all using combinators.
Which of these options should I choose? Are there alternatives that I'm missing altogether? Am I completely on the wrong track with this?
Does each jump instruction in your assembly language start with J? If there are no other instructions that do so you could start by recognizing "J" with the tag!() macro and then alt!() for the other two characters (assuming three-letter codes).
Could you post some more info about the jump instructions in your language?
Sure. They are JLT, JGT, JEQ, JLE, JGE, JNE, JMP, and the empty string. Each of them stands for some combination of "jump if previous computation was < 0, = 0, > 0".
I actually have another question. But first, I need to explain the assembly syntax a little more. Every instruction is of the form <dest>=<command>;<jump>, where each of <dest>, <command>, <jump> has one of finitely many possible values. The ones for <jump> are the ones I was talking above. <command> is what computation to perform, <dest> is where to save the result of <command>, and <jump> is whether we should jump, depending on the result of <command>.
So I can create parsers for each of these parts separately, and I expect that combining them won't be much of a problem. There's just one difficulty: both <dest>= and ;<jump> are actually optional, but <command> is mandatory. Missing <dest> and <jump> should be treated as "don't save anywhere" and "don't jump", respectively.
So my question is, assuming I know how to parse <jump>, how do I parse ;<jump> optionally?
In general, nom contains a lot of useful macros for combining small simple parsers to more powerful parsers, even if the documentation of them can be a bit cryptic sometimes.
But this parser would return None if there's no semicolon or if the jump mnemonic can't be parsed, right? If so, then it can't distinguish between the legitimate case of there being no semicolon + jump mnemonic at all and the error case of there being a semicolon and an invalid jump mnemonic.
Yes, kind of. If there is a semicolon which is not followed by a correct jump mnemonic, the opt_jump_parser will return None and leave the semicolon to be parsed by whatever comes next.
You can either consider this ok, as the next level out will search for a dest,a command, an opt_jump_parser followed by some kind of end-marker (end-of-line? Or can an end-of-line appear anywhere?).
Or you can create a parser that checks for a semicolon, and if it finds one requires a jump_parser. I think the easiest way to make that is probably with the switch!(...) macro.
Maybe I'm not understanding something, but the behavior of parse_to! doesn't seem to agree with its documentation (and also doesn't really make sense).
The documentation says that it consumes its input. This is also what I would have expected. But running
let s = "1234";
let r: nom::IResult<&str, u16> = parse_to!(s, u16);
println!("{:?}", r);
outputs
Done("1234", 1234)
i.e., the input hasn't actually been consumed. What am I missing?