Declare lifetime of a reference to an item from streaming iterator?

Hi,

Consider the following

use streaming_iterator::StreamingIterator;

struct Inner<'a>(&'a [u8], usize);

struct A<'a, I: StreamingIterator<Item = [u8]>> {
    iter: I,
    current: Option<Inner<'a>>,
}

impl<'a, I: StreamingIterator<Item = [u8]>> Iterator for A<'a, I> {
    type Item = [u8; 64];

    fn next(&mut self) -> Option<Self::Item> {
        if let Some(current) = self.current {
            todo!("output something based on current, potentially setting current to None")
        } else {
            // fetch new item
            self.current = self.iter.next().map(|x| Inner(x, 0));
            if let Some(current) = self.current {
                self.next()
            } else {
                None
            }
        }
    }
}

fn main() {}

This code does not compile because I can't assign self.current to an item with an autoref (from streaming iterator):

error[E0495]: cannot infer an appropriate lifetime for autoref due to conflicting requirements
   --> mod.rs:105:38
    |
105 |             self.current = self.iter.next().map(|x| Inner(x, 0));
    |                                      ^^^^
    |
note: first, the lifetime cannot outlive the anonymous lifetime defined here...
   --> mod.rs:100:13
    |
100 |     fn next(&mut self) -> Option<Self::Item> {
    |             ^^^^^^^^^
note: ...so that reference does not outlive borrowed content
   --> mod.rs:105:28
    |
105 |             self.current = self.iter.next().map(|x| Inner(x, 0));
    |                            ^^^^^^^^^
note: but, the lifetime must be valid for the lifetime `'a` as defined here...
   --> mod.rs:97:6
    |
97  | impl<'a, I: StreamingIterator<Item = [u8]>> Iterator for A<'a, I> {
    |      ^^
note: ...so that the expression is assignable
   --> mod.rs:105:28
    |
105 |             self.current = self.iter.next().map(|x| Inner(x, 0));
    |                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    = note: expected `Option<Inner<'a>>`
               found `Option<Inner<'_>>`

For more information about this error, try `rustc --explain E0495`

Any ideas how to make this compile? Specifically, I need to be able to store the state Inner that has a reference to data coming from a streaming iterator. Using unsafe is not allowed (forbid unsafe code), but e.g. RefCell is fine.

The main issue here seems to be my inability to specify that 'a in current: Option<Inner<'a>> is incorrect - I want it to represent the lifetime of the reference coming from I.

That means you're trying to create a self-referential struct, and there is no straight-forward way to do so.

And it's not just a syntactical difficulty, but the code using these interfaces could really be unsafe, and cause real crashes.

For example, if the I type stored its data inline, and the user of the A struct moved it after starting iteration, that would invalidate the pointer in current. It's not too far fetched - on-stack arrays are common, and I've needed to write code like let first = iter.next(); process_rest(iter).

With streaming iterators there's also a logical difficulty that calling advance (implied by next) invalidates all existing elements, so the field would be valid only some of the time. The borrow checker can't ensure that the field is None while some other code somewhere calls advance.

So this iterator has real unsafety, and the unsafety is too subtle for static analysis, so unfortunately you need unsafe and ensure correctness yourself.

Within safe Rust, you'd need Rc/Arc to hold on to the current element. And if it's a streaming iterator, it'd probably want Rc<RefCell or Arc<Mutex to modify the element in-place. But statically proving that you never use current after calling advance would be tricky (needs typestate cleverness), and probably not worth doing at language level.

1 Like

Rust does kind of allow self-referential structs since such types are auto generated for async functions. We can therefore make a solution using an async function.

Since async functions are basically just generators with some extra features we should be able to do the same with generators and get away with slightly less code. Unfortunately it seems that currently generators doesn't allow borrows over yield points but since it is an unstable nightly feature that might change in the future.

2 Likes

Thanks! Exactly, that was my thinking also. The element from the streaming iterator is not mutable, but Inner is.

I would be fine with runtime checks, but even then, I was unable to make it work - What do I Rc on to allow a runtime borrow check here? The value of the streaming iterator is a reference with a lifetime, I can't Rc that, right? I would need to change the streaming iterator to return an Rc?

And even so, I would still need self-references, since I would now need to hold the Rc somewhere and Inner would reference it.

Thanks for the suggestion!

Unfortunately all the logic that I am implementing is CPU-bounded, which means that if I make this API async, the caller will be blocked by the CPU-bounded activities and will have to do something about it (e.g. run it on a separate thread pool).

For this reason, I am generally splitting our interfaces between IO-bounded (async) and CPU-bounded (sync).

With my suggestion it doesn't matter that your code is CPU-bounded since the async function doesn't actually call any async code and the A struct only exposes sync APIs. Instead the async function is only used to be able to pause the functions execution and allow keeping the reference into the streaming iterator around even after yielding an item. Since the async function isn't hooked up to any executor or reactor it would actually be a bad idea to run any "real" async code inside the async function.

Actually, after thinking about it some more and looking at StreamingIterator's definition I don't think you need to make a self-referential struct at all. You could just get the slice again using StreamingIterator::get every time the Iterator::next method is called. Something like this.

2 Likes

Thanks a lot for the explanation.

Unfortunately the state of Inner in my case is not just a number - it is an iterator of u32 over the slice that progressively consumes the slice, reading it under a non-trivial combination of unpacking bitpacked numbers and others.

Example:

#[derive(Debug)]
struct Inner<'a> {
    values: parquet2::encoding::BitmapIter<'a>,
    validity: std::iter::Peekable<parquet2::encoding::hybrid_rle::Decoder<'a>>,
    offset: usize,
    remaining: usize,
    length: usize,
}

where both BitmapIter and Decoder are iterators (over different parts of the Vec<u8>). I.e.

  1. the derived state of the slices is not easy to recreate
  2. it is not easy to split the "state" of Inner from the ownership of its data, since some of that state are moving references along the byte slice

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.