Understanding Iterators (in Rust Token Lexing)

I've been trying to learn more about programming languages and their "under the hood" implementations. I've started trying to make my own very simple language to help with this, while also looking into how Rust itself is implemented as a guide.

The Cursor struct in Rust implements several helper functions (like first(), second(), and third()) that all call clone():

Is the input text really being copied so frequently, or is there some sort of clever compiler magic that just accesses a single copy of the input text? It seems rather inefficient to be copying text around so much (even if the text copying is fairly fast).

Put another way, why are there several calls to clone() instead of just accessing elements of a Vec or something like that? Is it just that using a Vec is not the approach functional programming takes?

are you refering to the clone() call as in self.chars.clone()? this just calls the clone() method of the Chars iterator, NOT the string itself.

many iterator types can be cloned, and can be cloned cheaply, and str::Chars is one such iterator type. this iterator is a wrapper of a byte slice iterator, which is just a pointer and an offset.

4 Likes

Ah, ok, so there is basically just one copy of the actual text/str, and the calls to clone() on the iterators are just copying pointers and offsets to the one str?

Yes, Chars holds just a couple of pointers, cloning it will just copy those, not the data they point to. It's basically a wrapper around a shared reference, except it isn't Copy because that can create footguns for iterators (like mistakingly calling .next() on a copy rather than the iterator you meant)

2 Likes

What does that mean in terms of code? I don't understand why calling next on a copy would cause trouble.

Calling next on a copy will advance that copy but not the original iterator, which can be very surprising if you later try to use the original iterator expecting it to have been advanced. What makes this bad is that copies can be created implicitly, so it's not always clear that a copy is happening.

1 Like

There is something missing with this explanation since that would seem to apply to any Copy types that have mutable methods—whether the methods are inherent or via traits.

fn main() {
    let mut x = 10;
    let y = x;
    x += 1;
    assert_ne!(x, y);
}

Most would not find the above surprising. I'm trying to at least partially formalize what it is about Iterators and other types that could be Copy but shouldn't due to this confusing behavior. I want to say "collection-like", but that would mean [T; N] shouldn't implement Copy, but it does.

Edit

That example is bad since it uses operands and not method syntax like next, but the question remains. Is the following confusing:

#[derive(Clone, Copy)]
pub struct Foo(u32);
impl Foo {
    pub fn bar(&mut self) {
        self.0 = self.0.wrapping_add(1);
    }
}

Maybe, but I think with proper documentation and a more realistic example many would not find it problematic. I "feel" like it has something to do with the fact that Iterators, while not required to, are expected to be mutated (e.g., via repeated called to next) until their state has changed such that next will always return None. Obviously for non-FusedIterators this isn't true though.

A separate issue is that an Iterator could have state that is expensive to duplicate, or that can't be duplicated safely. Therefore Copy is not appropriate.

That is true for non-Iterators too. The idea that Iterators shouldn't implement Copy I believe is separate; otherwise "lightweight" Iterators could implement Copy, but I believe a majority of people would still find that problematic. A type probably shouldn't implement Copy if it's expensive regardless of what the type represents. There is something else to it.

The simpliest example would be something like the following:

for item in copy_iterator {
    if should_skip_next(item) {
        copy_iterator.next(); // or advance_by
    }
    // do something else
}

And this is a pattern that people reach for pretty often! It's very common to consider an iterator as an entity with an identity that you mutate. On the other hand most Copy instances are generally considered as plain values that have no identity, even though you may call mutating methods on them.


As a sidenote, there exist a similar footgun with e.g. number: you can capture a copy into a move closure, and then mutate it, while still reading the old value outside. But this is rarer than the iterator issue, and numbers benefit a lot more by being Copy than Iterators. The only exception are ranges, because they are also often used as values rather than iterators, but there's already work in progress for new range types that implement Copy and IntoIterator, but not Iterator.

4 Likes

That's a great example; and to be clear, I wasn't disagreeing with the idea that Iterators probably shouldn't implement Copy but just the explanation. As a math guy, I do very much empathize with Haskellers in that "side effects" make things more complicated since algebraic reasoning cannot be used anymore. The Haskeller would say mutable methods shouldn't exist at all no matter if a type implements Copy or not; then even your closure example wouldn't be possible—since expecting the variable to have been mutated is absurd since mutations like that don't exist. Rust obviously has other goals, and I love the language. I'm just trying to generalize what it is about Iterators that makes their being Copyable problematic.