The following came up recently on discord and on twitter, and I was wondering if there is a way to solve this in stable Rust without introducing undefined behavior.
Context: We want to parse a huge file. That file contains many entries over which we want to iterate. Each entry has some metadata and some raw data. For simplicity, let's just assume that our entries are byte vectors encoded by an u8
specifying the number of bytes, and then the raw bytes right after. We are interested in single entries as a whole.
Example: The file with contents [1, 5, 0, 4, 3, 4, 5, 6]
contains the entries [5], [], [3, 4, 5, 6]
.
Approach: The parser should be efficient, hence we give it a buffer to avoid reallocations:
pub struct BufferedParser<Source> {
buffer: Vec<u8>,
source: Source, // the file we read from
}
To create a versatile parser, we want it to implement the Iterator
trait. This trait basically looks like this:
pub trait Iterator {
type Item;
fn next(&mut self) -> Option<Self::Item>;
}
Problem: We cannot simply implement Iterator
for BufferedParser<Source>
while having the returned Iterator::Item
borrow from the buffer, because we cannot relate the lifetimes of the argument of Iterator::next
and Iterator::Item
without GATs (I am aware that there are crates with lending iterator traits, but these are not compatible with the std lib Iterator
and IntoIterator
traits). To illustrate:
impl<'a, Source: Read + 'a> Iterator for &'a mut BufferedParser<Source> {
type Item = &'a [u8];
fn next<'this>(&'this mut self) -> Option<Self::Item> {
// update buffer...
Some(&self.buffer[x..y])
}
}
There is no way to make this compile, as the lifetimes 'this
and 'a
are not equal.
Solution: To solve this, I made the following attempt using unsafe Rust. Basically, we return the decomposition of the slice into pointers, wrapped into a type Entry
that counts the number of references to the buffer. When calling next
, we panic if there is still an Entry
in existence. For reference counting, I am abusing Rc
, and the Entry
type implements Deref
into a [u8]
to access the data.
use std::fmt::{Debug, Formatter};
use std::io::{ErrorKind, Read};
use std::ops::Deref;
use std::rc::Rc;
use std::{mem, slice};
pub struct BufferedParser<Source> {
buffer: Vec<u8>,
source: Source,
borrow_counter: Rc<()>,
}
#[derive(Clone)]
pub struct Entry {
offset: *const u8,
len: usize,
// I hope the compiler does not optimise this away...
#[allow(dead_code)]
borrow_counter: Rc<()>,
}
impl<Source: Read> Iterator for BufferedParser<Source> {
type Item = Entry;
fn next(&mut self) -> Option<Self::Item> {
// Making sure that all instances of Entry have been dropped.
let rc = mem::replace(&mut self.borrow_counter, Rc::new(()));
Rc::try_unwrap(rc).unwrap();
// A bunch of parsing code here, the more relevant part is at the end of this function.
#[allow(clippy::len_zero)]
if self.buffer.len() < 1 {
self.buffer.resize(1, 0);
}
match &self.source.read_exact(&mut self.buffer[0..1]) {
Ok(_) => {}
err @ Err(error) => {
if error.kind() == ErrorKind::UnexpectedEof {
return None;
} else {
err.as_ref().unwrap();
}
}
};
let length = u8::from_be_bytes(self.buffer[0..1].try_into().unwrap()) as usize;
if self.buffer.len() < length {
self.buffer.resize(length, 0);
}
self.source.read_exact(&mut self.buffer[0..length]).unwrap();
// Return the slice as a pointer and length, to escape it's lifetime.
let slice = &self.buffer[0..length];
let range = slice.as_ptr_range();
Some(Entry {
offset: range.start,
len: slice.len(),
borrow_counter: self.borrow_counter.clone(),
})
}
}
impl<Source> Drop for BufferedParser<Source> {
fn drop(&mut self) {
// Making sure that all instances of Entry have been dropped.
let rc = mem::replace(&mut self.borrow_counter, Rc::new(()));
Rc::try_unwrap(rc).unwrap();
}
}
impl Deref for Entry {
type Target = [u8];
fn deref(&self) -> &Self::Target {
unsafe { slice::from_raw_parts(self.offset, self.len) }
}
}
impl Debug for Entry {
fn fmt(&self, f: &mut Formatter<'_>) -> std::fmt::Result {
let slice: &[u8] = self.deref();
slice.fmt(f)
}
}
impl<Source> BufferedParser<Source> {
pub fn new(source: Source) -> Self {
Self {
buffer: Default::default(),
source,
borrow_counter: Default::default(),
}
}
}
fn main() {
let buffer = [1, 5, 0, 4, 3, 4, 5, 6];
let parser = BufferedParser::new(buffer.as_slice());
for slice in parser {
println!("slice: {slice:?}");
}
}
Is the usage of slice::from_raw_parts
sound in this example? And is this principle sound in general, assuming noone tampers with the borrow_counter
fields? Is there a better way (more efficient or more idiomatic) to solve this problem in current stable Rust?
Edit: added a Drop
implementation to the solution, since otherwise BufferedParser
could be dropped before an Entry
, leaving a dangling pointer.