How would I abstract over borrowed bytes, files, and arbitrary readers?

I'd like to write a parser which can work on bytes (&'buf [u8]), a file (special-cased because we can do nice things like mmap), and some sort of reader (roughly R: std::io::Read + std::io::Seek) in such a way that parsing can be zero-copy and point directly into the &'buf [u8] buffer when possible.

Otherwise it would fall back to doing buffered IO which uses something like bytes::Bytes to allow reusing parts of a previous read (imagine we're accessing parts of a large file from S3, so while we can easily "seek" back and forth to access just the bits we care about, each read might involve a network call).

The construction process would look fairly typical:

struct Parser<R> { ... }

impl<R: Read + Seek> Parser<R> {
  fn from_reader(reader: R) -> Self { ... }
} 

impl Parser<File> {
  fn from_file(path: impl AsRef<Path>) -> Result<Self, Error> { ... }
} 

impl<'buf> Parser<Cursor<&'buf [u8]>> {
  fn in_memory(buffer: &'buf [u8]) -> Self { ... }
} 

However, I'm struggling to figure out how I would abstract over this "maybe borrowed, maybe ref-counted aspect" while parsing.

I was thinking maybe something like this, but I'm not really sure how best to set up the lifetimes.

impl<R: MyReaderTrait> Parser<R> {
  fn parse(&mut self) -> Result<Document<'???>, Error> { ... }
}

trait MyReaderTrait: std::io::Seek {
  /// Try to read some bytes from the reader. 
  ///
  /// The actual number of bytes returned is decided to the underlying
  /// reader and may vary depending on (for example) what it has
  /// already buffered at the current location or some user-configurable
  /// "chunk size" parameter, or whatever.
  fn read(&mut self) -> Result<MaybeBorrowed<'???>, Error>;
}

struct Document<'a> {
  some_binary_field: MaybeBorrowed<'a>,
  integer: u64,
  ...
}

enum MaybeBorrowed<'a> {
  Borrowed(&'a [u8]),
  RefCounted(bytes::Bytes),
}

Where the result is that parsing a &'buf [u8] gives us a Data<'buf>, while parsing a blob on S3 would give us a Data<'static> where the bytes are buffered bytes::Bytes. Parsing a memory-mapped file would give you some sort of Data<'self>, where 'self is the lifetime of the Mmap object inside the parser.

I think I am missing something really important.

  1. I don't understand how seek is useful in the context of parsing. (Most parsers seem to have finite lookahead / lookback, unless you go packrat parsing, which I believe requires keeping all of input in memory).

  2. It sounds like you want to have a parser that does different things depending on whether the input implements seek, but for some reason, you want one type instead of two types ?

If you're abstracting over "a type which might be borrowed or might be owned" the lifetime is going to have to be limited by the borrowing case. The fact that you might sometimes get a MaybeBorrowed<'static> is opaque to code that has to deal with both cases.

fn read(&mut self) -> Result<MaybeBorrowed<'_, Error> will obviously limit what you can do when you use the trait, but you can always match on the enum and get the owned value out as an optimization if you need to.

1 Like

The Seek part is important because it lets you skip parts of the file you don't care about.

Imagine a binary format where different chunks are specified as offsets into the file - if you can do a seek to instantaneous skip to the chunk you want, you could avoid reading potentially gigabytes of data that will just be thrown away.

No, everything the Parser reads from would need to be seekable. The Parser::in_memory() constructor wraps the &[u8] in a std::io::Cursor so we can jump back and forward in the buffer.

Yeah, having the lifetime inferred would mean it's tied to the read() call, meaning the parsed Document can't ever reuse bytes in practice - it would always need to make owned copies.

The moment you need to read more bytes, you would get a "can't borrow reader as mutable because it is currently immutable borrowed by the document" error.

Okay I think I misunderstood what you were aiming for. It sounds like you want the lifetime to be on the trait itself.

impl<'r, R: MyReaderTrait<'r>> Parser<R> {
    fn parse(&mut self) -> Result<Document<'r>, Error> {
        todo!()
    }
}

trait MyReaderTrait<'r>: std::io::Seek {
    fn read(&mut self) -> Result<MaybeBorrowed<'r>, Error>;
}

Did that not work for you?

Perhaps you need a GAT, or a lifetime-GAT-emulating helper trait.

trait MyReaderTrait: for<'any> MyReadOnce<'any> {}
impl<T: ?Sized> MyReaderTrait for T
where
    T: for<'any> MyReadOnce<'any>
{}

trait MyReadOnce<'r>: Seek {
    type MaybeBorrowed /* : AnyBoundsYouNeed */ ;
    fn read(&'r mut self) -> Result<Self::MaybeBorrowed, Error>;
}

In which case, the borrow-from-reader implementations can be handled separately from the same-across-lifetimes implementations, so long as you handle the former one-by-one.

trait StaticReader: MyReaderTrait {
    type NotBorrowed;
}

impl<T: ?Sized, NB> StaticReader for T
where
    T: MyReaderTrait + for<'any> MyReadOnce<'any, MaybeBorrowed = NB>
{
    type NotBorrowed = NB;
}

impl<R: ?Sized> Parser<R: StaticReader> {
    fn parse(&mut self) -> Result<Document<Self::NotBorrowed>, Error> {
        // ...
    }
}

impl<R: ?Sized + MyReaderTrait> Parser<R>
where
    R: for<'any> MyReadOnce<'any, MaybeBorrowed = &'a [u8]>
{
    fn parse(&mut self) -> Result<Document<Vec<u8>>, Error> {
        // ...
    }
}

Untested, but based around the topic of abstracting over borrowing and non-borrowing closures which comes up from time to time.

3 Likes

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.