File reading iterator


#1

Hello everyone! A Python guy here, trying out Rust (begging pardon for any pythonisms!)

I’m writing a JSON parser as a learning project, and as a very first step I tried to create an iterator that would read a file in chunks of a pre-defined size and simply yield them to a consumer (the next step is to extract lexemes from those chunks but I’m just doing baby steps for now…)

I ended up with some mostly working code and I’d like to have a sanity check on that, please! Here’s the code:

struct Lexer {
    buf: [u8; BUFSIZE],
    f: File,
}

impl Iterator for Lexer {
    type Item = String;

    fn next(&mut self) -> Option<String> {
        match self.f.read(&mut self.buf) {
            Err(error) => panic!("Can't read: {}", error),
            Ok(0) => None,
            Ok(result) => Some(str::from_utf8(&self.buf[0..result]).unwrap().to_string()),
        }
    }
}

fn lexer(filename: &str) -> Lexer {
    Lexer {
        buf: [0; BUFSIZE],
        f: match File::open(filename) {
            Err(error) => panic!("Can't open {}: {}", filename, error),
            Ok(result) => result,
        },
    }
}

A few questions:

  • Is this whole approach idiomatic in Rust at all? I mean, having a lexer that you could iterate as for lexeme in lexer("somefile")?
  • My Lexer struct has a std::fs::File field but what I actuall need is anything implementing std::io::Read. I know I can declare the struct as Lexer<T: Read> { f: T } but then I can’t quite figure out the syntax for the trait implementation and the fabric function.
  • How to declare the type for an Iterator to be &str (there’s no need to create full-blown strings anyway)?

Thanks!


Parsing iterator over a file
#2

I think it’s idiomatic: iterators are nice.

One might write:

struct Lexer<R: Read> { ... }
impl<R: Read> Iterator for Lexer<R> {
    type Item = String;
}

fn lexer(filename: &str) -> Lexer<File> {
    ...
}

In this case, the lexer function is specialised to returning a lexer over a File, but would could change it to fn lexer<R: Read>(reader: R) -> Lexer<R> { ... }.

There is a reason for it: you are constructing the string entirely withing the next function, so one can’t return a reference pointing at anything. Hence, you need to return an owned data structure, and String is the appropriate one for textual data like this.

In particular, it isn’t possible to point to the buf field of self, since the yielded elements cannot depend directly on the memory owned by self (there’s no lifetime connecting the &mut self to the return value in the declaration of the Iterator trait.


#3
fn lexer(filename: &str) -> Lexer<File>

Thanks! I was close to that :smile: Next question is what do I do if I want another function accepting URLs instead of file names? The return type would probably be Lexer<SomeHTTPThing> but how do I make the function name generic?

In particular, it isn’t possible to point to the buf field of self, since the yielded elements cannot depend directly on the memory owned by self (there’s no lifetime connecting the &mut self to the return value in the declaration of the Iterator trait.

Is there a way around that? Because physically the buffer sits there for a long time anyway and it makes sense to avoid allocating and copying of every lexem. (This is a purely theoretical quesion for now, I’m not trying to prematurely optimize it!)