Borrowing iterator for file-parsing without lifetimes using unsafe Rust

isibboi · December 8, 2022, 10:07am

The following came up recently on discord and on twitter, and I was wondering if there is a way to solve this in stable Rust without introducing undefined behavior.

Context: We want to parse a huge file. That file contains many entries over which we want to iterate. Each entry has some metadata and some raw data. For simplicity, let's just assume that our entries are byte vectors encoded by an u8 specifying the number of bytes, and then the raw bytes right after. We are interested in single entries as a whole.

Example: The file with contents [1, 5, 0, 4, 3, 4, 5, 6] contains the entries [5], [], [3, 4, 5, 6].

Approach: The parser should be efficient, hence we give it a buffer to avoid reallocations:

pub struct BufferedParser<Source> {
    buffer: Vec<u8>,
    source: Source, // the file we read from
}

To create a versatile parser, we want it to implement the Iterator trait. This trait basically looks like this:

pub trait Iterator {
    type Item;

    fn next(&mut self) -> Option<Self::Item>;
}

Problem: We cannot simply implement Iterator for BufferedParser<Source> while having the returned Iterator::Item borrow from the buffer, because we cannot relate the lifetimes of the argument of Iterator::next and Iterator::Item without GATs (I am aware that there are crates with lending iterator traits, but these are not compatible with the std lib Iterator and IntoIterator traits). To illustrate:

impl<'a, Source: Read + 'a> Iterator for &'a mut BufferedParser<Source> {
    type Item = &'a [u8];

    fn next<'this>(&'this mut self) -> Option<Self::Item> {
        // update buffer...
        Some(&self.buffer[x..y])
    }
}

There is no way to make this compile, as the lifetimes 'this and 'a are not equal.

Solution: To solve this, I made the following attempt using unsafe Rust. Basically, we return the decomposition of the slice into pointers, wrapped into a type Entry that counts the number of references to the buffer. When calling next, we panic if there is still an Entry in existence. For reference counting, I am abusing Rc, and the Entry type implements Deref into a [u8] to access the data.

use std::fmt::{Debug, Formatter};
use std::io::{ErrorKind, Read};
use std::ops::Deref;
use std::rc::Rc;
use std::{mem, slice};

pub struct BufferedParser<Source> {
    buffer: Vec<u8>,
    source: Source,
    borrow_counter: Rc<()>,
}

#[derive(Clone)]
pub struct Entry {
    offset: *const u8,
    len: usize,

    // I hope the compiler does not optimise this away...
    #[allow(dead_code)]
    borrow_counter: Rc<()>,
}

impl<Source: Read> Iterator for BufferedParser<Source> {
    type Item = Entry;

    fn next(&mut self) -> Option<Self::Item> {
        // Making sure that all instances of Entry have been dropped.
        let rc = mem::replace(&mut self.borrow_counter, Rc::new(()));
        Rc::try_unwrap(rc).unwrap();

        // A bunch of parsing code here, the more relevant part is at the end of this function.
        #[allow(clippy::len_zero)]
        if self.buffer.len() < 1 {
            self.buffer.resize(1, 0);
        }
        match &self.source.read_exact(&mut self.buffer[0..1]) {
            Ok(_) => {}
            err @ Err(error) => {
                if error.kind() == ErrorKind::UnexpectedEof {
                    return None;
                } else {
                    err.as_ref().unwrap();
                }
            }
        };
        let length = u8::from_be_bytes(self.buffer[0..1].try_into().unwrap()) as usize;

        if self.buffer.len() < length {
            self.buffer.resize(length, 0);
        }
        self.source.read_exact(&mut self.buffer[0..length]).unwrap();

        // Return the slice as a pointer and length, to escape it's lifetime.
        let slice = &self.buffer[0..length];
        let range = slice.as_ptr_range();
        Some(Entry {
            offset: range.start,
            len: slice.len(),
            borrow_counter: self.borrow_counter.clone(),
        })
    }
}

impl<Source> Drop for BufferedParser<Source> {
    fn drop(&mut self) {
        // Making sure that all instances of Entry have been dropped.
        let rc = mem::replace(&mut self.borrow_counter, Rc::new(()));
        Rc::try_unwrap(rc).unwrap();
    }
}

impl Deref for Entry {
    type Target = [u8];

    fn deref(&self) -> &Self::Target {
        unsafe { slice::from_raw_parts(self.offset, self.len) }
    }
}

impl Debug for Entry {
    fn fmt(&self, f: &mut Formatter<'_>) -> std::fmt::Result {
        let slice: &[u8] = self.deref();
        slice.fmt(f)
    }
}

impl<Source> BufferedParser<Source> {
    pub fn new(source: Source) -> Self {
        Self {
            buffer: Default::default(),
            source,
            borrow_counter: Default::default(),
        }
    }
}

fn main() {
    let buffer = [1, 5, 0, 4, 3, 4, 5, 6];
    let parser = BufferedParser::new(buffer.as_slice());

    for slice in parser {
        println!("slice: {slice:?}");
    }
}

(Playground)

Is the usage of slice::from_raw_parts sound in this example? And is this principle sound in general, assuming noone tampers with the borrow_counter fields? Is there a better way (more efficient or more idiomatic) to solve this problem in current stable Rust?

Edit: added a Drop implementation to the solution, since otherwise BufferedParser could be dropped before an Entry, leaving a dangling pointer.

H2CO3 · December 8, 2022, 10:52am

When you are using unsafe for circumventing lifetime errors, you can be around 99% sure that it's unsound.

I'm not sure that this particular piece is unsound, but it is at least logically incorrect for sure. The test of the pudding is in the eating: the primary reason why Iterator::next() has to decouple the lifetimes of self and Item is to support collect(). If you try collecting your iterator, it simply panics.

Given that this ends up allocating upon every call to next() anyway (even though the () is zero-sized, the Rc heap-allocates the strong and weak counters!), it doesn't seem like this usage of unsafe is particularly justified for reasons of performance.

With regards to the borrow_counter(): that feels like a non-idiomatic re-implementation of half of RefCell. That's where borrow counters are. Thus, one can re-write the whole thing without a line of unsafe using RefCell: Playground

#[derive(Clone)]
pub struct Entry {
    buffer: Rc<RefCell<Vec<u8>>>,
    len: usize,
}

impl Entry {
    fn borrow(&self) -> Ref<'_, [u8]> {
        Ref::map(self.buffer.borrow(), |v: &Vec<_>| &v[..self.len])
    }
    
    fn borrow_mut(&self) -> RefMut<'_, [u8]> {
        RefMut::map(self.buffer.borrow_mut(), |v: &mut Vec<_>| &mut v[..self.len])
    }
}

impl Debug for Entry {
    fn fmt(&self, f: &mut Formatter<'_>) -> std::fmt::Result {
        self.borrow().fmt(f)
    }
}

pub struct BufferedParser<R> {
    buffer: Rc<RefCell<Vec<u8>>>,
    reader: R,
}

impl<R> BufferedParser<R> {
    pub fn new(reader: R) -> Self {
        Self {
            buffer: Default::default(),
            reader,
        }
    }
}

impl<R: Read> Iterator for BufferedParser<R> {
    type Item = io::Result<Entry>;

    fn next(&mut self) -> Option<Self::Item> {
        let mut length = 0;
        match self.reader.read_exact(core::slice::from_mut(&mut length)) {
            Ok(_) => {}
            Err(error) if error.kind() == ErrorKind::UnexpectedEof => {
                return None;
            }
            Err(error) => return Some(Err(error))
        };
        let length = usize::from(length);

        if self.buffer.borrow().len() < length {
            self.buffer.borrow_mut().resize(length, 0);
        }
        match self.reader.read_exact(&mut self.buffer.borrow_mut()[..length]) {
            Ok(_) => {}
            Err(error) => return Some(Err(error)),
        }

        Some(Ok(Entry {
            buffer: self.buffer.clone(),
            len: length,
        }))
    }
}

Edit: Turns out, this is incorrect, too. It shares the buffer, so every item ends up being a sub-slice of the last item. Shared mutability is hard. And bad. Don't use it. At least not by bending the type system like this.

You should probably use some sort of copy-on-write mechanism, or give up on implementing Iterator altogether, and provide an interface that doesn't mislead the user into thinking that they've got a nice and normal iterator, which then blows up in their hands.

trentj · December 8, 2022, 12:00pm

Two common ways to do this kind of thing,

make a separate struct that implements Iterator and implement IntoIterator for &'a BufferedParser
if BufferedParser really has to implement Iterator for some reason, just have it borrow the data from somewhere else ^[1]

would seem to work fine.

which is what your code is effectively doing anyway, because the slices are borrowed from the inner buffer, except you've hidden it from the compiler in a way that implies memory leakage or panicking instead of a compile error if you break that assumption ↩︎

H2CO3 · December 8, 2022, 12:01pm

Here's an updated version that uses copy-on-write by swapping buffers in and out of the Entry and only allocating a new one inside the BufferedParser if the previous entry still holds onto the buffer.

isibboi · December 8, 2022, 12:01pm

Alright thanks! So there are a few things:

Using Rc to implement reference counting is inefficient. Hence I should use e.g. AtomicUsize manually.
My implementation violates the Iterator contract. An iterator is required to return values that can be alive until the iterator itself gets Dropped. But my implementation mimicks more the LendingIterator which requires items only to be alive between two calls of next.
For the good part, it seems like my implementation cleanly panicked when used in a way that violates the LendingIterator constraints, i.e. trying to call next while an item is still alive. For the bad part, of course all the APIs using iterators under the assumptions that items may be alive longer than until the next call to next will trigger an unwanted panic. For those APIs, one would need to add an OwnedEntry, whose lifetime is truly independent from the iterator. Then, for e.g. collecting, one would need to map the items through to_owned first.

H2CO3 · December 8, 2022, 12:02pm

Rc doesn't use atomics (that would be Arc), so an atomic implementation would be at least no faster (and at worst, slower).

isibboi · December 8, 2022, 12:04pm

Right yeah, that would be Rc<RefCell<usize>> then.

isibboi · December 8, 2022, 12:09pm

Thanks! I must say I don't totally understand what you mean.

if I borrow the parser in the iterator, would the problem not be the same? In the end, the iterator would still need to mutate the buffer in some way, and then borrow from the buffer.
this I find interesting, but I also don't completely understand. So you propose some architecture, where the buffer and the parsing iterator are somehow separate entities as opposed to the parser containing the buffer? But then any API that supports the Iterator trait could also not be used, or could it? Since then the API would need to support this auxiliary data for the iterator.

trentj · December 8, 2022, 12:24pm

See the documentation of Itertools::group_by for an example of this technique. Note that GroupBy does not implement Iterator.
This is effectively the same thing, I guess, just depends on what you call a parser. I don't think of a parser as something that owns the thing it's parsing, so I might just have a Buffered<Source> and then give the parser a reference to it. The parser can implement Iterator and hand out references to the buffer that aren't bound to its own (the iterator's) lifetime.

I'm assuming the buffer is not just a Vec, but probably something more arena-like. Admittedly this suggests some things might not get freed until the buffer itself is dropped.

I don't find the Iterator abstraction maps onto parsers terribly well, TBH. Generally the point of a parser is to deserialize something, which means its output should be in some sense traversable in "both" directions (an AST, or something with lookahead, and Peekable is hard to use).

isibboi · December 8, 2022, 12:26pm

That's great!

I have one concern only, which is that cloning the Entry would now clone the buffer as well. I am not sure if in anyone's parser that would be an issue, but for the sake of it, could one:

put the buffer itself behind Rc<RefCell<Vec<u8>>> in the parser, and share it with all entries, i.e. entry contains the same Rc<RefCell<Vec<u8>>>,
use my code for "abusing" Rc for reference counting in the original post to panic if any entries exist when calling next or drop on the parser,
give the entry a length parameter, indicating the length of the prefix of the buffer that is currently valid, and
give the Entry a function that returns the valid slice from the buffer, i.e. &buffer.borrow()[0..length]?

Then there should be only a constant number of allocations, except for resizing the vector. And there should be no issue with unsoundness, since no unsafe code is involved. If the buffer would ever change while an Entry exists, then the parser would panic instead.

H2CO3 · December 8, 2022, 12:29pm

In my code, Entry still has a len…?

That was my first attempt, but it doesn't work. The parser has to hold onto the buffer as well, so if you put an Rc<RefCell<Vec<_>>> into the entry as well, then it's not going to be uniquely owned, hence the relevant APIs on Rc (e.g. make_mut(), get_mut()) don't work (the former clones unnecessarily, and the latter incorrcetly returns None).

isibboi · December 8, 2022, 1:38pm

And RefCell::borrow returns a guard that we cannot simply take and return a slice from, since it is a temporary object and not just a reference to the slice inside the Vec.

I did manage to make it work now using Rc<RefCell<Vec<u8>>>, and I think it allocates only to resize the buffer (plus constant allocations). Unfortunately, to get a slice out of the Entry, we still need unsafe code (try_borrow_unguarded), because the safe borrow returns a guard and not just a reference. Here is the updated code.

Shortened version with the most relevant parts:

pub struct BufferedParser<Source> {
    buffer: Rc<RefCell<Vec<u8>>>,
    source: Source,
}

#[derive(Clone)]
pub struct Entry {
    buffer: Rc<RefCell<Vec<u8>>>,
    length: usize,
}

impl<Source: Read> Iterator for BufferedParser<Source> {
    type Item = Entry;

    fn next(&mut self) -> Option<Self::Item> {
        // Make sure that all instances of Entry have been dropped.
        assert_eq!(Rc::strong_count(&self.buffer), 1);
        let buffer = &mut *self.buffer.borrow_mut();

        // Parsing code...

        // Give the entry a pointer to the buffer, and the length of the current entry.
        Some(Entry {
            buffer: self.buffer.clone(),
            length,
        })
    }
}

impl<Source> Drop for BufferedParser<Source> {
    fn drop(&mut self) {
        // Make sure that all instances of Entry have been dropped.
        assert_eq!(Rc::strong_count(&self.buffer), 1);
    }
}

impl Deref for Entry {
    type Target = [u8];

    fn deref(&self) -> &Self::Target {
        // Unfortunately the following does not work, as borrow() creates a temporary guard object.
        // let buffer = self.buffer.borrow()
    
        // Safety: borrowing the buffer mutably only happens in BufferedParser::next,
        //         and there we borrow the parser mutably (uniquely), and check that the parser is
        //         the only one having a reference to the buffer at that point.
        //         The references created here cannot outlive the Entry, hence they cannot be alive
        //         while only the parser has a reference to the buffer.
        let buffer = unsafe {self.buffer.try_borrow_unguarded()}.unwrap();
        &buffer[0..self.length]
    }
}

H2CO3 · December 8, 2022, 1:40pm

That's no problem, you can just return a Ref or a RefMut.

Please don't do that. Just return a Ref or RefMut, as per the above.

kornel · December 8, 2022, 4:25pm

You seem to be trying to implement a StreamingIterator. The Rust's standard Iterator intentionally forbids that pattern.

For every Iterator this must always work:

let a = iter.next()?;
let b = iter.next()?;
drop(iter);
use_both(a, b);

which means you must never lend any data that belongs to the iterator, because returned items must outlive it. You must never modify data that has already been given out, because (without interior mutability) you'd be circumventing either immutability or uniqueness guarantees, and you'd be breaking uses like collect().

quinedot · December 8, 2022, 6:01pm

I also want to emphasize one particular part of @kornel's post: returned values can be alive longer than the iterator. collect consumes the iterator, after all.

I agree more generally with the sentiment: if you can't be a (proper) Iterator, don't implement Iterator. Especially not via unsafe - if you need that to make it work, it's almost certainly unsound. The struggles you're hitting are from the language trying to tell you that's a square peg you're trying to force into a round hole.

parasyte · December 8, 2022, 9:34pm

This very specific example can be parsed efficiently with nom and a safe iterator: https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=6556a0214e50a7c108db0bee35bb072d (It's the first solution described in this comment.)

Your real data format is obviously going to be more complex. But maybe this gives you some ideas for tackling those issues in a safe and efficient manner.

isibboi · December 9, 2022, 7:24am

Thanks for the input!

Parasyte, I don't see how your code works when the input is not given as a &[u8], but any type that implements Read instead. Remember that the scenario is "streaming", i.e. we don't want to load the whole file into memory first, but process its entries one-by-one.

I see that using unsafe is in general not the best thing to do. Using H2CO3's idea of not implementing Deref, but using a custom method instead, this can easily be circumvented.

Then there is still the part where I am violating the contract of the Iterator trait, as many of you have pointed out. In my last example, I did that artificially by panicking when the iterator was used in a way that is not a streaming iterator. So what can be done instead is to simply allocate a new buffer in that case. That way, the iterator will get the efficiency* bonus from not allocating when being used in a streaming fashion, while still upholding the Iterator contract. Here is the updated code. Most notably, collect does not panic.

Most relevant parts from the code:

#![forbid(unsafe_code)]

#[derive(Debug)]
pub struct BufferedParser<Source> {
    buffer: Rc<RefCell<Vec<u8>>>,
    source: Source,
}

#[derive(Clone)]
pub struct Entry {
    buffer: Rc<RefCell<Vec<u8>>>,
    length: usize,
}

impl<Source: Read> Iterator for BufferedParser<Source> {
    type Item = Entry;

    fn next(&mut self) -> Option<Self::Item> {
        // If there are still undropped entries, we need a new buffer.
        // Otherwise, we would update the buffer under the entries feet, making them invalid.
        if Rc::strong_count(&self.buffer) > 1 {
            self.buffer = Default::default();
        }
        let mut buffer = self.buffer.borrow_mut();

        // Parsing code... (omitted)

        // Give the entry a pointer to the buffer, and the length of the current entry.
        Some(Entry {
            buffer: self.buffer.clone(),
            length,
        })
    }
}

impl Entry {
    fn get(&self) -> Ref<'_, [u8]> {
        Ref::map(self.buffer.borrow(), |buffer| &buffer[0..self.length])
    }
}

fn main() {
    let buffer = [1, 5, 0, 4, 3, 4, 5, 6];
    let parser = BufferedParser::new(buffer.as_slice());

    for slice in parser.clone() {
        println!("slice: {slice:?}");
    }

    let slices: Vec<_> = parser.collect();
    println!("slices: {slices:?}");
}

This should now work as intended, right? No unsafe code, and the Iterator contract is upheld?

(*) maybe read this more as a "potential for better efficiency", as I have not actually measured if this is faster now. I would hope though that avoiding allocations here is faster in at least some practical settings.

H2CO3 · December 9, 2022, 8:21am

Yes, except that it can be simplified even further. You may now drop the RefCell and just use Rc::make_mut(), and thus you can impl Deref for Entry, too: Playground.

A further optimization opportunity would be to use Rc<[u8]> instead of Rc<Vec<u8>>, avoiding a double indirection. This would require you to implement the growing strategy manually, though (i.e., allocate a new buffer not only when it's shared, but also when the current one is too short).

isibboi · December 9, 2022, 8:50am

Great!

Is there a method similar to Rc::make_mut() that initialises with default instead of clone?

And I tried a bit now, but I am unable to construct an Rc<[u8]> with dynamic size, since Rc::new wants the argument to be Sized. Do you know how this would work?

H2CO3 · December 9, 2022, 9:23am

You would use e.g. Rc::from(vec![0; size]) for that.

Topic		Replies	Views
GAT Iterator with mapping function help	4	986	December 8, 2021
Need help/suggestion to implement iterator/reader help	7	646	June 22, 2021
Creating an iterator where the underlying datasource doesn't live long enough help	5	614	January 12, 2023
Confused about lifetimes, pls help help	2	281	February 7, 2023
[Solved] Iterator borrowing from itself and abstracting over mutability help	3	661	July 9, 2019

Borrowing iterator for file-parsing without lifetimes using unsafe Rust

Related topics