Code review for simple iterator implementation

I'm writing a corpus struct for working on text files. Below is the very first part of implementing an iterator to loop over all the sentences of a corpus. (I notice there's a split_terminator in std, but I'd like to keep the terminator in split string.

I'm very new to Rust and have some questions.

  1. Should I use a name like sentences() instead of split_terminator()?
  2. Should I add a new() function for Sentences?
  3. And I'm not clear if I should pass lines() or reader or even path to create Sentences in this case. The Sentences struct should be only created by split_terminator, but I'm not sure if there's some conventions or design to do this?
  4. What's the idiomatic way to handle errors when creating an iterator? The -> io::Result<Sentences> seems a bit weird. I didn't find much examples of this since functions that create iterators generally return some iterable directly like Lines, Chars, Split or SplitWhitespace without an option.
  5. Any suggestion further to write idiomatic code?

Thanks!

use std::collections::VecDeque;
use std::fs::File;
use std::io;
use std::io::{BufRead, BufReader};
use std::path::PathBuf;

pub struct Corpus {
    pub path: PathBuf,
}

pub struct Sentences {
    lines: std::io::Lines<BufReader<File>>,
    sentences: VecDeque<String>,
    pat: char,
}

impl Iterator for Sentences {
    type Item = String;

    fn next(&mut self) -> Option<Self::Item> {
        if self.sentences.is_empty() {
            match self.lines.next() {
                Some(Ok(line)) => {
                    self.sentences = line
                        .split_inclusive(self.pat)
                        .map(str::to_owned)
                        .collect::<VecDeque<_>>();
                    self.next()
                }
                None => None,
                _ => self.next(),
            }
        } else {
            Some(self.sentences.pop_front().unwrap())
        }
    }
}

impl Corpus {
    pub fn new<P: Into<PathBuf>>(path: P) -> Self {
        Corpus { path: path.into() }
    }

    pub fn split_terminator(&self, pat: char) -> io::Result<Sentences> {
        let file = File::open(self.path.as_path())?;
        let reader = BufReader::new(file);

        Ok(Sentences {
            lines: reader.lines(),
            sentences: VecDeque::new(),
            pat,
        })
    }
}

The name of the method should generally match the name of the iterator type.

No.

I'm sorry, I don't understand what you are asking here.

Creating an iterator shouldn't be fallible. It's somewhat unusual to wrap an io::Reader into an iterator; but if you need to do this, then your iterator's item type should be an io::Result instead.

1 Like

I would recommend the example of BufRead::lines. sentences is analogous. I.e., the iterator itself should yield io::Result (let the caller decide what to do if reading fails, instead of... recursing infinitely, which is what it looks like you're currently doing in the case of an error).

Having a PathBuf in Corpus and opening it in split_terminatorsentences (why pick a name from the standard library if it doesn't do something analogous?) doesn't make sense to me, because opening a file is something you generally want to do only once in a program. In other words, a Corpus is something you turn into a Sentences, so sentences should probably take self by value (or it should perhaps contain the BufReader directly, instead of a PathBuf). Does a Corpus always have to be from a File? Couldn't you have a corpus stored in a String (perhaps for testing) or a memory-mapped buffer (perhaps for performance)?

1 Like

Here are a couple of non-recursing versions of your iterator.

Question, though: is Lines really the correct abstraction? You're not guaranteed a sentence terminator every line of a file, and not every line ending will correspond to a sentence ending. Seems like you should read the file entire, or have a more complicated approach that appends lines or other buffer lengths until a sentence terminator or EOF is encountered.

See also BufRead::read_until, but that is byte-based and not char-based.

Thanks for reading my code!

I'm sorry, I don't understand what you are asking here.

Hello, I mean currently I'm passing lines() in split_terminator instead passing the reader or the path member of the Corpus struct to avoid file open errors when creating Sentences. But I'm not sure whether I should do that.

but if you need to do this, then your iterator's item type should be an io::Result instead.

Do you mean even the error from File::open I should handle it in Sentences?

Hi, thanks for reading my code.

the iterator itself should yield io::Result

The problem to me is, the error comes from File::open and only happen once. While BufRead::lines could emit errors for each line. I'm not sure where to put the File::open part. Is it strange to handle error from opening the file and reading lines together in the iterator?

Does a Corpus always have to be from a File ?
sentences should probably take self by value

My original thought: I don't want the corpus to be moved because one may loop the corpus multiple pass and the reading the whole corpus may consume lots of memory.

Having a PathBuf in Corpus and opening it in split_terminator sentences (why pick a name from the standard library if it doesn't do something analogous?) doesn't make sense to me, because opening a file is something you generally want to do only once in a program.

That's new to me, I've not thought this way. I'm used to write generators like this in Python (writing classes with a method yielding items from file). Before I wrote the code, I struggled in whether I should have this Corpus struct. (I want to have the ability for caller to loop over the corpus file multiple times, do some NLP processing like tokenizing, collocation extraction, ...) In fact, some of the processing can do with some chain calls like reader.lines().map().split_terminator()..., I'm not sure if I should abstract away them.

Do you have suggestions for me how to better organize the code in Rust? Should I move the corpus in this situation?

And thanks for the memory-mapped buffer idea!

Hi, your code looks much clearer than mine! I'm a bit confused about how does the loop exit (by returning None?)?

You're not guaranteed a sentence terminator every line of a file, and not every line ending will correspond to a sentence ending.

It seems str::split_inclusive could handle these cases properly.

is Lines really the correct abstraction?

I'm not sure whether I should add methods like this to create simple iterator wrappers. I want to have ability to do some preprocessing on the corpus text but some of them can be as easy as writing reader.lines().map().split().map()... by the caller.

Of course. That's line 26 with the question mark operator applied to self.lines.next().

You shouldn't try to open a file or perform other side effecting operations in a mere iterator constructor. In general, since files aren't the only kind of source one can read from, it's poor practice to store file names in data structures and then assume that the only possible way one may ever want to use the iterator is by reading from a file. You should make Corpus generic over any reader instead (or store a Box<dyn Read> in it) and store the reader itself, while perhaps providing a convenience constructor using a file path as well.

1 Like

Those are different errors. File::open can return some kinds of error but Lines::next can also return (different) errors. You're handling File::open but if Lines::next returns an error you just recurse (which is almost certain to do nothing useful).

Yes, that would be strange, since neither of those errors is part of the iterator's "core business", i.e. splitting a buffered reader into sentences. If the reader returns an error, that's not something you can usefully deal with at the level of next; it's the caller's problem. Similarly, I agree with H2CO3's comments re: performing complicated side effectful computation in an iterator constructor. Just take a <R: BufRead> and let the caller figure out how to get it to you; they have to, anyway.

That's exactly what your implementation does, though: cloning the corpus is cheap and easy because it's only a filename, and iterating twice is expensive because you have to open the file and parse the data (handling all errors) each time you iterate. If reading the whole file into memory at once is prohibitive (but this is often the fastest and best solution, at least when memory-mapping is not an option), you could use Seek to reset the read position at each call to sentences, and take &mut self, instead of repeatedly opening the file (which forces you to deal with an error each time, and is very liable to subject your code to race conditions with respect to the filesystem).

2 Likes

You can think of the ? (aka Try or question-mark) operator as working like so in Option-returning functions:

    // let foo = option_returning_fn()?;
    let foo = match option_returning_fn() {
        Some(thing) => thing,
        None => return None,
    };

(It's more complicated in reality so that ? will be able to support custom types eventually.)

Maybe you don't mean literal sentences, but I was thinking of something like

We come now to the most important idea in this whole book, the notion
of a generating function.  An infinite sequence <a0, ...> that we wish
to deal with in some way can conveniently be represented as a power
series an an auxiliary variable z.

If I break this up by lines, and then split on periods, I end up with

  • [line 0] Beginning of first sentence
  • [line 1] End of first sentence
  • [line 1] Beginning of second sentence
  • [line 2] Middle of second sentence
  • [line 3] End of second sentence
1 Like

That's line 26 with the question mark operator applied to self.lines.next() .

Didn't know ? works on Option. Thanks for letting me know that!

You should make Corpus generic over any reader instead (or store a Box<dyn Read> in it) and store the reader itself

Thanks for your advices! I'll try up update the code!

Thanks for your comment! I need some time to fully understand the details! :slight_smile:

but I was thinking of something like

Your are right :slight_smile: ! I missed this.

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.