Writing a parser with nom


#1

I’m going through The Elements of Computing Systems right now. In chapter six, the task is to write an assembler in the language of your choice. I’m doing it in Rust.

I already have a (horribly hacky) parser for the specified assembly language, but I’d like to do it a little more properly. To that end, I thought I’d use the nom crate to build the parser (I realize that this is probably massive overkill for writing an assembler, but I want to learn to do things properly in Rust while I’m at it). Here’s where I’m at: Every command instruction in the assembly language contains one of eight jump strings, specifying whether to perform a jump if the output of the command is less than, equal to, or greater than zero. For instance, “JNE” means “jump if the output is not equal to zero (i.e. either less than or greater than)”.

I would like to parse the eight different strings into a struct

Jump { ng: bool, zr: bool, ps: bool }

and I can basically see two ways to accomplish this using nom.

  1. Write a parsing function, i.e. a function jump(input: &str) -> IResult<&str, Jump>, by hand. This would mean that I have to implement the error handling myself.

  2. Write a separate parser for each of the eight strings and combine them, all using combinators.

Which of these options should I choose? Are there alternatives that I’m missing altogether? Am I completely on the wrong track with this?


#2

Does each jump instruction in your assembly language start with J? If there are no other instructions that do so you could start by recognizing “J” with the tag!() macro and then alt!() for the other two characters (assuming three-letter codes).

Could you post some more info about the jump instructions in your language?


#3

Sure. They are JLT, JGT, JEQ, JLE, JGE, JNE, JMP, and the empty string. Each of them stands for some combination of “jump if previous computation was < 0, = 0, > 0”.

My best attempt at this so far looks like this:

named! {
      jump_parser<&str, Jump>,
        map!(alt!(
              tag!("JGT") | tag!("JEQ") | tag!("JGE") | tag!("JLT") | tag!("JNE") | tag!("JLE") | tag!("JMP")
            ), |input| match input {
                "JGT" => Jump::new(false, false, true),
                "JEQ" => Jump::new(false, true, false),
                "JGE" => Jump::new(false, true, true),
                "JLT" => Jump::new(true, false, false),
                "JNE" => Jump::new(true, false, true),
                "JLE" => Jump::new(true, true, false),
                "JMP" => Jump::new(true, true, true),
                _ => panic!()
        })
    }

This works, but I do wonder if there is a more elegant way to do it.


#4

The value!(..) macro allows you to inverse that parser into something like:

named! {
    jump_parser<&str, Jump>,
    alt!(
        value!(Jump::new(false, false, true), tag!("JGT"))
      | value!(Jump::new(false, true, false), tag!("JEQ"))
      | ...
    )
}

That way, you at least only have to mention each mnemonic once, and you don’t have to have a (hopefully unreachable) panic! in the code.


#5

Wow, that’s exactly what I was looking for! Thanks so much :slight_smile:


#6

I actually have another question. But first, I need to explain the assembly syntax a little more. Every instruction is of the form <dest>=<command>;<jump>, where each of <dest>, <command>, <jump> has one of finitely many possible values. The ones for <jump> are the ones I was talking above. <command> is what computation to perform, <dest> is where to save the result of <command>, and <jump> is whether we should jump, depending on the result of <command>.

So I can create parsers for each of these parts separately, and I expect that combining them won’t be much of a problem. There’s just one difficulty: both <dest>= and ;<jump> are actually optional, but <command> is mandatory. Missing <dest> and <jump> should be treated as “don’t save anywhere” and “don’t jump”, respectively.

So my question is, assuming I know how to parse <jump>, how do I parse ;<jump> optionally?


#7

The semicolon is present only if followed by a jump instruction? Something like this should work:

named! {
    opt_jump_parser<&str, Option<Jump>>,
    opt!(preceded!(tag!(";"), jump_parser))
}

In general, nom contains a lot of useful macros for combining small simple parsers to more powerful parsers, even if the documentation of them can be a bit cryptic sometimes.


#8

But this parser would return None if there’s no semicolon or if the jump mnemonic can’t be parsed, right? If so, then it can’t distinguish between the legitimate case of there being no semicolon + jump mnemonic at all and the error case of there being a semicolon and an invalid jump mnemonic.


#9

Yes, kind of. If there is a semicolon which is not followed by a correct jump mnemonic, the opt_jump_parser will return None and leave the semicolon to be parsed by whatever comes next.

You can either consider this ok, as the next level out will search for a dest,a command, an opt_jump_parser followed by some kind of end-marker (end-of-line? Or can an end-of-line appear anywhere?).

Or you can create a parser that checks for a semicolon, and if it finds one requires a jump_parser. I think the easiest way to make that is probably with the switch!(...) macro.


#10

What I have now is this:

named! {
      jump_opt<&str, Jump>,
      alt!(
          preceded!(tag!(";"), jump)
        | value!(Jump::new(false, false, false), not!(tag!(";")))
      )
    }

I.e. if there’s a ;, return the result of the jump parser, or return the trivial Jump struct if there is no ;. I suppose that dest will work similarly.


#11

Maybe I’m not understanding something, but the behavior of parse_to! doesn’t seem to agree with its documentation (and also doesn’t really make sense).

The documentation says that it consumes its input. This is also what I would have expected. But running

    let s = "1234";
    let r: nom::IResult<&str, u16> = parse_to!(s, u16);
    println!("{:?}", r);

outputs

Done("1234", 1234)

i.e., the input hasn’t actually been consumed. What am I missing?