I'd like to write a parser which can work on bytes (&'buf [u8]
), a file (special-cased because we can do nice things like mmap), and some sort of reader (roughly R: std::io::Read + std::io::Seek
) in such a way that parsing can be zero-copy and point directly into the &'buf [u8]
buffer when possible.
Otherwise it would fall back to doing buffered IO which uses something like bytes::Bytes
to allow reusing parts of a previous read (imagine we're accessing parts of a large file from S3, so while we can easily "seek" back and forth to access just the bits we care about, each read might involve a network call).
The construction process would look fairly typical:
struct Parser<R> { ... }
impl<R: Read + Seek> Parser<R> {
fn from_reader(reader: R) -> Self { ... }
}
impl Parser<File> {
fn from_file(path: impl AsRef<Path>) -> Result<Self, Error> { ... }
}
impl<'buf> Parser<Cursor<&'buf [u8]>> {
fn in_memory(buffer: &'buf [u8]) -> Self { ... }
}
However, I'm struggling to figure out how I would abstract over this "maybe borrowed, maybe ref-counted aspect" while parsing.
I was thinking maybe something like this, but I'm not really sure how best to set up the lifetimes.
impl<R: MyReaderTrait> Parser<R> {
fn parse(&mut self) -> Result<Document<'???>, Error> { ... }
}
trait MyReaderTrait: std::io::Seek {
/// Try to read some bytes from the reader.
///
/// The actual number of bytes returned is decided to the underlying
/// reader and may vary depending on (for example) what it has
/// already buffered at the current location or some user-configurable
/// "chunk size" parameter, or whatever.
fn read(&mut self) -> Result<MaybeBorrowed<'???>, Error>;
}
struct Document<'a> {
some_binary_field: MaybeBorrowed<'a>,
integer: u64,
...
}
enum MaybeBorrowed<'a> {
Borrowed(&'a [u8]),
RefCounted(bytes::Bytes),
}
Where the result is that parsing a &'buf [u8]
gives us a Data<'buf>
, while parsing a blob on S3 would give us a Data<'static>
where the bytes are buffered bytes::Bytes
. Parsing a memory-mapped file would give you some sort of Data<'self>
, where 'self
is the lifetime of the Mmap
object inside the parser.