Make this iteration more idiomatic, and fast?

I've implemented a crate for reading MPEG Transport Streams.

In my benchmarking, the performance hotspot has usually been the packet-reading loop in Demultiplex::push(). So, I 'optimised' it:

https://github.com/dholroyd/mpeg2ts-reader/blob/ca64aedbb01c915df5a98f46a384ddd22de01841/src/demultiplex.rs#L513-L561

That's rather ugly! It's the piece of the codebase I'm most unhappy with at the moment.

I've made more than one attempt to refactor this using to iterators. For example,

    pub fn push(&mut self, ctx: &mut Ctx, buf: &[u8]) {
        for pk in buf.chunks(packet::PACKET_SIZE).map(packet::Packet::new) {
            let this_pid = pk.pid();
            {
                if !self.processor_by_pid.contains(this_pid) {
                    let filter = ctx.filter_constructor().construct(FilterRequest::ByPid(this_pid));
                    self.processor_by_pid.insert(this_pid, filter);
                };
                let this_proc = self.processor_by_pid.get(this_pid).unwrap();
                this_proc.consume(ctx, &pk);
            }
            if !ctx.filter_changeset().is_empty() {
                ctx.filter_changeset().apply(&mut self.processor_by_pid);
            }
            debug_assert!(ctx.filter_changeset().is_empty());
        }
    }

However the microbenchmark in the crate takes a 15-20% performance hit.

FWIW note the above refactored version also,

  • Lacks the error handling for incorrect SYNC_BYTE values
  • Doesn't attempt to optimise for the case were multiple packets in a row have the same PID (and therefore use the same handler)

Any thoughts on how I can make the real code prettier, without a slowdown?

(Persuasive arguments that the benchmark itself is wrong would also be useful!)