Interface design advice to avoid extra parsing work

I'm working on a parser for a binary file format that is basically a stream of events. Each time the parser encounters an event, it builds a struct for that event's data and passes it to a callback function.

pub trait EventHandler {
  pub fn event_a(&mut self, a: EventA) {}
  pub fn event_b(&mut self, b: EventB) {}
  //...
}
pub struct MyHandler;
impl EventHandler for MyHandler {
  pub fn event_a(&mut self, a: EventA) { println!("{:?}", a); }
  // ignores event_b
}
fn main() {
  let mut buf = read_file("path/to/file").unwrap();
  let mut handler = MyHandler;
  parse(&mut buf, &mut handler);
}

The problem with this setup is that the parse function in this example wastes time building EventB despite the handler never needing it. If the parser could "know" the EventHandler doesn't need EventB it could easily skip over each occurrence to save time. What's a good way to design the interface to make this possible? Is this a textbook case for lazy parsing?

A couple off-the-cuff thoughts:

  • Add a method to the trait to check whether it cares. For example, you could use
pub trait EventHandler {
  pub fn cares_about_event_a(&self) -> bool { false }
  pub fn event_a(&mut self, a: EventA) { unimplemented!() }
  ...
}

and the consumer of the trait needs to check the "do they care?" function before they call the event function.

  • Change the trait to return the event handler, maybe something like
pub trait EventHandlerProvider {
  pub fn event_a_handler(&mut self) -> Option<impl FnOnce<EventA>> { None }
  ...
}

(Though that specific form with the RPIT-in-trait doesn't actually exist right now, so it'd need a workaround for a while, but you could do something inspired by that.)

  • If getting the details is delayable, then you could try something like you brought up,
pub trait EventHandler {
  pub fn event_a<H: FnOnce() -> EventA>(&mut self, _handler: H) {}
  ...
}
  • You could use a registration system of some sort rather than having them all on one trait. Like let people register Box<dyn FnMut(EventA)>s with the parser and call them all when you see such an event -- and you'd know if there aren't any so you could skip getting the details.

  • You could make EventA be a thin wrapper around a &[u8] into the buffer so it's trivial to construct, but the methods on the struct would pull things out of the serialized stream on demand.

  • You could decide that it's probably so cheap to deserialize a flat event from a binary format that you might not need to worry.

5 Likes

Since this seems like the most general solution, I outlined an (inefficient) prototype here.

1 Like

From this, I'm assuming that the binary format is structured such that after determining the event type, you either know the event payload size or can easily read it from header information, in order to skip over it.

If there's structure to the event payload which you're saving time by skipping over parsing, though, that means there's potentially ill-formed structure that you're skipping over without validating. Generally, it's probably good practice to validate the structure of the data is good, even if you're not processing it any further. Once you've done the validation, actually putting references into an Event structure should be a trivial amount of work, and perhaps even work that the optimizer can cut out if the code is monomorphized (generic not using dyn).

If the binary format is structured such that you can't use a zero-copy view into the buffer to pass the validated payload to the event handler, that's an unfortunately poorly designed format. (Zero-copy doesn't actually mean zero copies, though, to be clear; it means zero large/heap copies/allocations, where variable-length data is used directly from the deserialization buffer rather than a separate heap copy. Things like fixed, constant size structure absolutely gets copied into structured data.)

3 Likes

An approach the wasmparser crate uses is to have a top-level parser which splits the WebAssembly file up by section and gives you an iterator of "payloads", where the wasmparser::Payload is an enum that might contain the data in-line (e.g. if it's something cheap like the version number) or will give you sub-readers for parsing that section.

That way you can defer the extra parsing work until it is needed.

For example,

struct Parser<'buffer> {
  buffer: &'buffer [u8],
}

impl<'buffer> Parser<'buffer> {
  fn parse_all(self) -> impl Iterator<Item = Result<Payload<'buffer>, Error>> {
    todo!();
  }
}

enum Payload<'buffer> {
  Version {
    num: u32,
    encoding: Encoding,
    range: Range<usize>,
  },
  TypeSection(TypeSectionReader<'buffer>),
  ImportSection(ImportSectionReader<'buffer>),
  FunctionSection(FunctionSectionReader<'buffer>),
  ...
}
6 Likes