Why using the read_lines() iterator is much slower than using read_line()?

Hello I am new to rust, I started writing rust just 3 days ago, I went through the rust docs and read an example of how to read files efficiently, but trying that method with the for loop is much slower than going through the file using the read_line() method from BufReader. For example read1 is at least 2x faster than read2 . Am I doing something wrong? Does anyone know the reason for that?

fn read1(file: File) -> std::io::Result<()>{
    let mut reader = BufReader::new(file);
    let mut _l = String::new();
    loop {
        let len = reader.read_line(&mut _l)?;
        if len == 0 {
            break;
        }
        _l.clear();
    }

    return Ok(());
}

fn read2(file: File) {
    let reader = BufReader::new(file);
    for _line in reader.lines() {}
}

Just a suggestion maybe read_line reads the whole file while lines divide the whole file based on return characters...

read_line() reads just one line and puts in the reference you passed to it.

@johnstef My bad didn't saw the loop

The lines() method automatically attached to anything implementing std::io::BufRead is really convenient but it has a big performance problem - the iterator returns an owned String so we need to allocate a new string on the heap for every single line in the file. Allocating memory is pretty fast, but when SSDs can read something like 200MB/s and lines might only be 100 bytes long, doing all that allocation only to throw a line away immediately can add up.

On the other hand, calling read_line() and passing in an existing string (_l in your example) means we only ever have one string buffer and each line we read will just overwrite the previous. This avoids the need to allocate at the cost of losing access to a line once you've progressed to the next (because we just did _l.clear() and overwrote the buffer).

10 Likes

Ok that's what I thought too, but I had to ask if I am doing something wrong because of the "efficient method" example in rust docs.

I'd say the "more efficient" bit is because if I've got a 10GB file, reading it entirely into memory would require about 10GB of ram, whereas using for line in buf_reader.lines() and throwing line away at the end of each iteration will use memory proportional to the largest line.

So what we're seeing is that you have multiple approaches, each with their own strengths and weaknesses.

Method Convenience Access Memory Usage
std::fs::read_to_string() It's just a string Random access Allocates a buffer the same size as the file
buf_reader.lines() Iterator of strings Streaming, can hold on to strings for random access Allocate one string per line
buf_reader.read_line(&mut buffer) Manual buffer management Streaming only Single buffer the size of the longest line
mmap You can get a &str, but it's unsafe and platform-dependent Random access zero[1]

That table isn't perfect, but you can see there's a general tradeoff between performance and convenience, with a naive std::fs::read_to_string() being the most convenient, and memory mapped files being the most performant - especially for larger files.


  1. From the perspective of your program and OOMs. The OS will manage paging parts of the file into and out of memory for you. ↩︎

6 Likes

The property of being more “efficient” can only apply to differences between the two versions. This rust-by-example page is quite confusing in my opinion, the way it’s presented; it should more clear about what kind of difference was made to the code and should refrain from unnecessarily re-formatting and re-ordering the code.

The actual differences being made is that

  • the argument is generalized to any AsRef<Path>, which is a very minor efficiency benefit, and also a convenience benefit
  • the return type becomes Result, resulting in better error handling ability

The comment at the end of the page seems completely nonsensical.

This process is more efficient than creating a String in memory especially working with larger files.

This comment would make sense if the presented “Beginner friendly method” would use read_to_string and a lines iterator on the String; but the way the code examples stand now, there’s barely any difference.

This problem seems to have been noticed already as the same page in the nightly docs features a “naive” example that’s actually doing something significantly different – based on their comment, that’s also the version that @Michael-F-Bryan seems to be describing above.

Funnily enough, using the supposedly “inefficient” read_to_string makes it a lot easier to avoid the overhead of creating lots of small owned Strings, because the lines iterator on str/String creates borrowed &strs by default anyways, and only by wrapping it into a helper function and adding something like .map(String::from), it re-gains the overhead of creating many small Strings.

Also, these new nightly docs still contain severe issues, as they claim

We also update read_lines to return an iterator instead of allocating new String objects in memory for each line.

which is just plain wrong (or at least awfully misleading), as far as I can tell. It implies that the iterator would not be creating new String allocations for every line. While it’s true that creating a Vec holding all the Strings is avoided, that overhead was arguably negligible compared to all the allocations of the Strings themselves. While it’s true that the (shallow data of the) String “objects” is no longer placed on the heap, most people will understand “allocating new String objects” completely differently. Maybe it’s referring to the fact that the String allocations don't necessarily exist at the same time anymore, but as it’s written that’s not what this sentence conveys to me either.

I’ve raised an issue for these concerns.

6 Likes

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.