Writing an iterator for a parser

The title does not really capture the essence of my question, unfortunately.

I'll try to give some context and how I arrived at the current problem.

I have a pretty simple API that takes some raw bytes, parses them, and returns some information. For simplicity, I'll use this as an example:

#[derive(Debug, Copy, Clone)]
struct ParsedData {
    value: u32,
    size: usize,
}

#[derive(Debug, Copy, Clone, PartialEq)]
enum ParseError {
    InvalidSize,
    InvalidByte,
}

impl ParsedData {
    // If data.len() < 2, return Err(InvalidSize)
    // If data[0] is 0, return Err(InvalidByte)
    // If data[0] is even, return Ok(value: data[0] + data[1], size: 2)
    // If data[0] is odd, return Ok(value: data[0], size: 1)
    fn parse(data: &[u8]) -> Result<Self, ParseError> {
        let first = *data.get(0).ok_or(ParseError::InvalidSize)? as u32;
        let second = *data.get(1).ok_or(ParseError::InvalidSize)? as u32;

        if first == 0 {
            Err(ParseError::InvalidByte)
        } else {
            if first % 2 == 0 {
                Ok(Self { value: first + second, size: 2, })
            } else {
                Ok(Self { value: first, size: 1, })
            }
        }
    }
}

Since a single array of bytes could contain multiple valid data points of different sizes, with invalid bytes mixed and match between them, the following code pattern is repeated by multiple callers:

    let data = vec![0, 1, 2, 3, 4];
    let mut offset = 0;

    while offset < data.len() {
        match ParsedData::parse(&data[offset..]) {
            Ok(parsed_data) => {
                println!("offset: {:#x} data: {:?}", offset, parsed_data);
                offset += parsed_data.size;
            },
            Err(e) => {
                println!("offset: {:#x} error: {:?}", offset, e);
                offset += 1;
            }
        }
    }

This is quite error prone and I'd like to avoid that. My first idea was to make a Parser structure:

struct Parser<'a> {
    data: &'a [u8],
    offset: usize,
}

impl<'a> Parser<'a> {
    fn new(data: &'a [u8]) -> Self {
        Self { data, offset: 0}
    }

    fn parse_next(&mut self) -> Result<ParsedData, ParseError> {
        let result = ParsedData::parse(&self.data[self.offset..]);
        match result {
            Ok(parsed_data) => self.offset += parsed_data.size,
            Err(_) => self.offset += 1,
        };

        result
    }
}

This makes things a little easier, but not by much:

    let data = vec![0, 1, 2, 3, 4];
    let mut parser = Parser::new(&data);
    loop {
        match parser.parse_next() {
            Ok(parsed_data) => println!("data: {:?}", parsed_data),
            Err(ParseError::InvalidSize) => break,
            Err(ParseError::InvalidByte) => (),
        }
    }

This is when I realized that what I want is an Iterator, so my next step was this:

impl<'a> Iterator for Parser<'a> {
    type Item = Result<ParsedData, ParseError>;
    
    fn next(&mut self) -> Option<Self::Item> {
        let result = self.parse_next();
        if result.is_err() && result.err().unwrap() == ParseError::InvalidSize {
            None
        } else {
            Some(result)
        }
    }
}

This means that users can now greatly simplify their code:

    let data = vec![0, 1, 2, 3, 4];
    let parser = Parser::new(&data);
    for parsed_data in parser {
        println!("{:?}", parsed_data);
    }

However, there is one problem here: I no longer know what offset a parsed data structure comes from, and sometimes that information is important. It seems wrong to stuff that information into ParsedData because it has no purpose being there - it tells nothing about the parsed data, but about the origin of parsed data, and the origin is not relevant in all cases.

Basically it looks like a need something like std::iter::enumerate(), that returns an offset instead of an iteration count. I can sort of do that by wrapping ParsedData into another structure that also holds the offset:

struct ParsedDataIter {
    parsed_data: ParsedData,
    offset: usize,
}

Is there a better solution than this? Also, how do I decide if I should implement Iterator or IntoIterator? Is it a bad idea to have an iterator that jumps through the collection at a random pace as the one I have above does? It feels like it can break some assumptions.

Personally, I would probably just add an offset field to ParsedData. The offset is an attribute of the parsed data, and I doubt there's any real cost to including it in the cases where it's unused.

It is not really an attribute of the parsed data because I can have two different byte arrays that will output the same ParsedData from different offsets, so comparing two ParsedData structures for equality becomes a bit harder, but I could implement PartialEq myself in this case. It surely feels better than having a wrapper structure.

You could consider making ParsedData an enum, so that the offset is only stored in variants where it is meaningful.

If you really want to keep offset separate from ParsedData, I'd suggest just returning a pair (usize, ParsedData) instead of a custom struct ParsedDataIter.

1 Like

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.