Why using the read_lines() iterator is much slower than using read_line()?

johnstef · April 20, 2023, 3:11pm

Hello I am new to rust, I started writing rust just 3 days ago, I went through the rust docs and read an example of how to read files efficiently, but trying that method with the for loop is much slower than going through the file using the read_line() method from BufReader. For example read1 is at least 2x faster than read2 . Am I doing something wrong? Does anyone know the reason for that?

fn read1(file: File) -> std::io::Result<()>{
    let mut reader = BufReader::new(file);
    let mut _l = String::new();
    loop {
        let len = reader.read_line(&mut _l)?;
        if len == 0 {
            break;
        }
        _l.clear();
    }

    return Ok(());
}

fn read2(file: File) {
    let reader = BufReader::new(file);
    for _line in reader.lines() {}
}

Reust · April 20, 2023, 4:04pm

Just a suggestion maybe read_line reads the whole file while lines divide the whole file based on return characters...

johnstef · April 20, 2023, 4:16pm

read_line() reads just one line and puts in the reference you passed to it.

Reust · April 20, 2023, 4:21pm

@johnstef My bad didn't saw the loop

Michael-F-Bryan · April 20, 2023, 4:29pm

The lines() method automatically attached to anything implementing std::io::BufRead is really convenient but it has a big performance problem - the iterator returns an owned String so we need to allocate a new string on the heap for every single line in the file. Allocating memory is pretty fast, but when SSDs can read something like 200MB/s and lines might only be 100 bytes long, doing all that allocation only to throw a line away immediately can add up.

On the other hand, calling read_line() and passing in an existing string (_l in your example) means we only ever have one string buffer and each line we read will just overwrite the previous. This avoids the need to allocate at the cost of losing access to a line once you've progressed to the next (because we just did _l.clear() and overwrote the buffer).

johnstef · April 20, 2023, 5:58pm

Ok that's what I thought too, but I had to ask if I am doing something wrong because of the "efficient method" example in rust docs.

Michael-F-Bryan · April 21, 2023, 3:31am

I'd say the "more efficient" bit is because if I've got a 10GB file, reading it entirely into memory would require about 10GB of ram, whereas using for line in buf_reader.lines() and throwing line away at the end of each iteration will use memory proportional to the largest line.

So what we're seeing is that you have multiple approaches, each with their own strengths and weaknesses.

Method	Convenience	Access	Memory Usage
`std::fs::read_to_string()`	It's just a string	Random access	Allocates a buffer the same size as the file
`buf_reader.lines()`	Iterator of strings	Streaming, can hold on to strings for random access	Allocate one string per line
`buf_reader.read_line(&mut buffer)`	Manual buffer management	Streaming only	Single buffer the size of the longest line
mmap	You can get a `&str`, but it's `unsafe` and platform-dependent	Random access	zero^[1]

That table isn't perfect, but you can see there's a general tradeoff between performance and convenience, with a naive std::fs::read_to_string() being the most convenient, and memory mapped files being the most performant - especially for larger files.

From the perspective of your program and OOMs. The OS will manage paging parts of the file into and out of memory for you. ↩︎

steffahn · April 21, 2023, 5:17am

The property of being more “efficient” can only apply to differences between the two versions. This rust-by-example page is quite confusing in my opinion, the way it’s presented; it should more clear about what kind of difference was made to the code and should refrain from unnecessarily re-formatting and re-ordering the code.

The actual differences being made is that

the argument is generalized to any AsRef<Path>, which is a very minor efficiency benefit, and also a convenience benefit
the return type becomes Result, resulting in better error handling ability

The comment at the end of the page seems completely nonsensical.

This process is more efficient than creating a String in memory especially working with larger files.

This comment would make sense if the presented “Beginner friendly method” would use read_to_string and a lines iterator on the String; but the way the code examples stand now, there’s barely any difference.

This problem seems to have been noticed already as the same page in the nightly docs features a “naive” example that’s actually doing something significantly different – based on their comment, that’s also the version that @Michael-F-Bryan seems to be describing above.

Funnily enough, using the supposedly “inefficient” read_to_string makes it a lot easier to avoid the overhead of creating lots of small owned Strings, because the lines iterator on str/String creates borrowed &strs by default anyways, and only by wrapping it into a helper function and adding something like .map(String::from), it re-gains the overhead of creating many small Strings.

Also, these new nightly docs still contain severe issues, as they claim

We also update read_lines to return an iterator instead of allocating new String objects in memory for each line.

which is just plain wrong (or at least awfully misleading), as far as I can tell. It implies that the iterator would not be creating new String allocations for every line. While it’s true that creating a Vec holding all the Strings is avoided, that overhead was arguably negligible compared to all the allocations of the Strings themselves. While it’s true that the (shallow data of the) String “objects” is no longer placed on the heap, most people will understand “allocating new String objects” completely differently. Maybe it’s referring to the fact that the String allocations don't necessarily exist at the same time anymore, but as it’s written that’s not what this sentence conveys to me either.

I’ve raised an issue for these concerns.

system · July 20, 2023, 5:18am

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.

Topic		Replies	Views
BufReader 100x slower than Python — am I doing something wrong? help	8	4797	May 18, 2021
Is it possible to parse the file line by line without doing an allocation per line help	4	1692	March 11, 2022
Rust IO speed versus other languages help	15	5004	January 12, 2023
"conflicting requirements" what? help	20	1565	January 12, 2023
Reading numbers from a file	5	6158	January 12, 2023

Why using the read_lines() iterator is much slower than using read_line()?

Related topics