A little help with lifetime error

I am trying to implement a java-esque scanner object. Well the code fails, but I don't get what the compiler is trying to say.

pub struct SingleReadScanner<'a> {
    buffer: Vec<u8>,
    iter: str::SplitAsciiWhitespace<'a>
}
impl<'a> SingleReadScanner<'a> {
    pub fn default() -> Self {
        Self {
            buffer: vec![],
            iter: "".split_ascii_whitespace()
        }
    }
    pub fn next<T: str::FromStr>(&mut self) -> result::Result<T, T::Err> {
        if let Some(token) = self.iter.next() { return token.parse(); }
        io::stdin().read_to_end(&mut self.buffer).unwrap();
        self.iter = unsafe { str::from_utf8_unchecked(&self.buffer).split_ascii_whitespace() };
        self.iter.next().unwrap().parse()
    }
}

However if the unsafe part is passed to mem::transmute() first (which is unsafe too), the program compiles fine.

I'm not sure why this is the case. The idea is that iter must some sort of iterator to buffer and so I would like to avoid any mediation like the transmute here. Can someone figure out what is wrong and explain the compiler message (E0495)?

It looks like you are trying to make a self-referential struct: one that contains a reference pointing inside itself. That is not possible in safe Rust, because allowing it would cause dangling pointers if the struct were moved.

Generally, you are right to say that you want to avoid unsafe. Typically, you only need unsafe for interacting with external unsafe C code through FFI, or for implementing your own, very specialized data structures that need to do their own memory management for some reason (this latter case is actually very rare, even compared to interacting with FFI).

Instead of putting the owning buffer and its view (the iterator in your case) in the same type, split them up, so that the owning and the borrowing is done in two distinct types.

However, you have another problem. You are trying to continuously hold an immutable borrow to your buffer (which is what the iterator does with it), but sometimes, you are trying to mutate the same buffer. Disallowing this is the very point of Rust's borrowing system – in fact, I think your code would not be correct nor sound as-is, because you are trying to hold an iterator over some bytes that suddenly change while the iterator is still looking at them. Therefore, you will need to substantially re-think your design, probably by moving the iterator into a local variable.

A third soundness hole in your code is using str::from_utf8_unchecked() on arbitrary data read from I/O. Such data is not guaranteed to be valid UTF-8, and therefore blindly slapping the unsafe on your code means that it will cause Undefined Behavior when it reads invalid / untrusted string data.

3 Likes

So a variable and any kind of reference to it cannot coexist in the same struct. Does it work if the iterator was enclosed in a tuple struct?

No, it doesn't matter how many levels of types you have between the two. The point is, a reference cannot be contained within the same memory region it points inside, because if the memory region is moved, then the reference ends up pointing into invalid memory.

Tuple structs aren't special: they are structs that contain their data just like regular structs, but the fields are indexed, not named. They are merely a syntactic convenience abstraction, they still work like a container of other values. And since they contain their data just like structs with named fields, and containment is transitive, if your named-field struct contains a tuple struct that contains a reference, then your outer, named-field struct also (transitively) contains the reference. So when the whole thing is moved, its parts are moved, and the reference again ends up pointing to invalid memory.

Maybe you are thinking that structs in Rust cause heap allocation and indirection? They do not. They contain all of their fields directly, by value.

Think of it like this: you have a big box at home, this is your outer struct with named fields. It contains a sheet of paper on which your home address is written, this is the reference. If you move to another place, and you bring your big box with yourself, the address written on the sheet of paper will point to an address that is no longer yours.

Now, would it help if you put the sheet of paper inside another, smaller box, within your bigger box? Of course it wouldn't. Your moving and your address changing would still cause the address written on the paper to be invalidated.

3 Likes

Thanks for the analogy. If I understand correctly, all references are supposed to go out of scope when the owner mutates and in this case iter can't go out of scope unless the whole Scanner goes out of scope.

What if buffer is passed to instead of being a member of the Scanner? I've seen someone do it before.

You can indeed pass a mutable borrow to the buffer, but in this case, I think it's actually easier to both implement and use the type if you keep the buffer inside the Scanner, and only create temporary borrows, keeping track of the current index only. I have created a Playground example that demonstrates this technique.

Here, I have also created an error type which can handle both I/O errors and UTF-8 conversion errors, because it's bad practice to panic in Result-returning functions. After all, I/O errors and invalid user input are expected and not fatal (recoverable), so you should handle or propagate them instead of crashing.

4 Likes

THANK YOU for the comprehensive code.

Just one thing, doesn't initializing the iterator from the start to end everytime you input a variable slow the program down? That was why I was rooting for keeping iter in the struct (so that it is initialized only once).

I have no idea how much overhead it has. It surely doesn't rescan the entire string every time, because that's the whole point of creating a lazy iterator, so it won't end up being an O(N^2) algorithm. It might have slightly higher overhead than reusing the same iterator, but it is almost surely dwarfed by the I/O, likely by several orders of magnitude, because string parsing iterators typically do not even allocate. So don't optimize prematurely.

You're right, it happens to be a map.
https://doc.rust-lang.org/src/core/str/iter.rs.html#1160

I've heard maps are lazy, so that checks out.

1 Like

For the same code, can you replace read_to_end() with fill_buf() and clear() with consume()?

Convert

let mut buf = mem::take(&mut self.buffer).into_bytes();
buf.clear();
io::stdin().read_to_end(&mut buf).unwrap();

to something like

let stdin = io::stdin();
let mut lock = stdin.lock();
let buf = lock.fill_buf().unwrap().to_vec();
lock.consume(buf.len());

except I don't know how to reuse the buffer like you did.

If you want to do that, you can reuse the vector like this:

let mut lock = stdin.lock();
vec.clear();
vec.extend_from_slice(lock.fill_buf().unwrap());
lock.consume(vec.len());

To be clear, this doesn't do the same thing as read_to_end. This will only give you the bytes available now, and more may become available later, unlike read_to_end which sleeps until the stream is closed.

(post withdrawn by author, will be automatically deleted in 24 hours unless flagged)

Sometimes the program is interactive, which means the input cannot block until EOF. That is why I'm considering fill_buf() too. Please suggest a better (faster) method if there is one.

Doesn't extend_from_slice() involve cloning? Can't cloning be avoided?

If you wanted to avoid a copy of the data, you need to access the data in the slice returned by fill_buf directly. I don't think you should spend time to try to avoid it.