Can I use BufReader with_capacity for large file

Hi All,

I am reading a file of 40MB. As per the doc, BufReader is not suitable for larger file. I tried 2 use cases

  • Case 1
    Use BufReader - The reading was blazing fast but I get 0 :frowning:
  • Case 2
    Use FileReader - The reading was slow but I get correct value

I am evaluating Rust to port one of my tool so that we get better performance while reading 10000 files.

I saw this option of BufReader::with_capacity(40000000, f);. Can I use this to get blazing fast and correct result? Is there any drawback?

BufReader is suitable for larger files. What the doc says is that using BufReader wouldn't be beneficial if you always read from this file in large chunks, like dozens of kilobytes.

If you have large files but tends to read from it only few bytes at once, like file.read_u32::<BE>() using the byteorder crate, the BufReader would significantly improve the performance.

If you think your file is small enough like under gigabytes and wants to read it as a whole eagerly, you can read it into Vec<u8> or String using std::fs::read() or std::fs::read_to_string(). This would be the fastest on reading file itself, but would use more memory if the file get larger.

2 Likes

Thanks for quick reply. The file i am reading is a binary file. First few bytes (approx 100) are of mixed data type. So I read those bytes in chucks using read() and convert to f32 or i16. Then the last read is the biggest one. Which creates an i16 array of size 21210000 elements. I am facing 0s only in this last read

Do you think byteorder crate can be used in my situation?

Edit: I looked into the API and I think I am doing mostly what it does. So, it will also have the same problem :frowning:. I was thinking what if I read the entire file using BufReader as you suggested, then use the index operation to transform into my datatypes. Will this work? Will this avoid 0s?

Or can I break 21210000 element array into smaller size and use for loop to read the data? If yes, how can I optimise the array concatenation? I think joining Vec<i16> will be costly.

This sounds like a different problem. Could you describe it more?

If you're getting 0 as a result from read(), then you've reached the end of the file. There isn't any different API which will change that.

If the bytes you read are 0s, then chances are that's what's in the file.

These APIs are suboptimal when dealing with different situations, but none of them, BufReader included, should ever give incorrect data. If you're getting incorrect data from the file then something else is wrong.

1 Like

Sorry. I meant I initialised the array with 0s and I see only those for huge file. I am not sure whether I reached end of file or I was hallucinating. Now I reverted code to read only from BufReader and strangly it works. I swear I saw 0s (initialised values), then I removed BufReader and then I saw correct value. So, I raised this question. Now I reverted to BufReader and it works again.

Could there be inconsistencies if I do --debug or --release or machine available memory at that time?

One more question since you are here and you helped me a lot earlier.
I am having a function which returns MyStruct with the big array. I could see the performance is hit. Can I send the big array with some additional variables as tuple to improve performance. I read in internet, using struct will force the variable into Heap and this will cause some performance hit.

If there's a bug in the program which stops reading early or doesn't keep track of the length correctly, then this could definitely be the case. It could also be effected by OS caching, and how many bytes read() returns each time.

This shouldn't happen without something else being wrong, though. I'm fairly confident that both File and BufReader are correct.

I believe a struct and a tuple, containing the same data, behave identically. So there shouldn't be any performance difference.

With the heap, I don't think this will ever happen. Rust is very explicit about heap allocation - structs are never automatically allocated. Some other languages, like Java, have all classes heap allocated - but this doesn't happen in Rust.

As for performance, heap allocating the big array might actually be a good idea! If it's bigger than one or two Mb, then storing it in a heap-allocated array like a Box<[T; N]> or Box<[T]> or in a Vec<...> is better than using a stack-allocated fixed sized array [T; N]. Returning from a function can move the result, and moving a pointer to somewhere on the heap is a lot faster than moving all of the data.

Assuming you're currently storing a [T; N], try a Box<[T; N]>. Besides that, I think I'd need to see more code to offer optimization suggestions.

Thanks @daboross . Your explanation is very helpful :slight_smile: on using struct vs tuple. So I think then I will stick with Box<[T; N]>

However, I didn't get your OS caching part. My code snippet is something like this,

        let mut f = fs::File::open(path).expect("no file found");
        let mut f = BufReader::new(f); //BufReader::with_capacity(40000000, f);
        let mut buf1=[0u8;1*4];
        let mut buf2=[0u8;6*4];
        let mut buf3=vec![0u8;10*4]; 
        f.read(&mut buf1).expect("Error");
        f.read(&mut buf2).expect("Error");
        f.read(&mut buf3).expect("Error");
        let desc = String::from_utf8(buf3).expect("Found invalid UTF-8");
        let mut buf4=vec![0u8;1000*4];
        f.read(&mut buf4).expect("Error"); 
        let small_data =byte_array_to_f32_vec(&buf4);
        let mut buf5=vec![0u8;30000*1000*2];
        f.read(&mut buf5).expect("Error"); 
        let data = byte_array_to_i16_vec(&buf13);

        let mine = MyStruct {
                small_data: small_data,
                big_data: big_data
        };
        return mine;

This is my snippet; Do you think I will not face incorrect data due to memory?

1 Like

I don't think these lines are doing what you expect.

f.read(&mut buf1) means "read up to buf1.len() bytes into buf1". This call will read anywhere from 1 to 4 bytes. See the Read::read docs. This is why read() returns an integer! The return value is the number of bytes read.

As for the solution I think f.read_exact (Read::read_exact) will behave like you're expecting read to.

My point was that in incorrect code, whether or not the operating system has a file cached can affect the code's behavior. As an example, we can go through some possible behavior with your snippet of code!

Depending on how much of the file the OS caches, read() will return differently. Say it has the entire file cached, since you read it recently. It will probably be very fast for the OS to get the bytes you need - so read() on a 4000 byte buffer has a decently good chance to read 4000 bytes.

But if the file hasn't been read recently, then your operating system won't have it cached, so it will have to go to disk. This takes longer, and you have a better chance of reads not reading the maximum number of bytes. If the OS has some of the bytes, and it has to go to disk again, rather than waiting for that, it will instead just give you what it has. So when you read on a 4000 byte buffer, the OS likely will get some of those bytes - maybe 1024 bytes, then return to you. You would then need to call read again to get the rest.

The Read::read documentation itself goes over this in more detail, if you want.

1 Like

Okay. That does means I should replace all read with read_exact? Since it said no guarantee I was hesitant to use it.

No guarantees are provided about the contents of buf when this function is called, implementations cannot rely on any property of the contents of buf being true

Maybe as you suggested, when I use read on vec![0u8;30000*1000*2] it didn't fill it fully earlier.

Yep.

If you ever use read, you need to use it in a loop, and call it again when it returns less than your total number of bytes. Calling it once is pretty much never right.

read_exact does that for you by calling read in a loop, and only returning once you've actually read the right number of bytes, or the file has ended.

Even with read_exact, it's worth storing the returned integer, and checking to make sure it matches your expected size. When it doesn't, that means the file ended before you could read enough, and you will have the unfilled 0s. Nevermind this comment - it looks like it does return an error if you have insufficient bytes. Just calling read_exact should be go.od

This is a note to implementors of read_exact.

Note that they talk about when this function is called, not when it returns. So they are talking about the state of the buffer when you put it in, not the result you take out! The next part of it says "implementations cannot rely on..."

Since you are calling read_exact, not implementing it, you can/should ignore this note.

Edit: I think the docs for this could definitely be improved, though... They are unnecessarily negative. I'm going to open an issue on the rust-lang repo for this.

Thanks @daboross - Today I learnt about the difference. And I am happy I am contributing to the Rust community alteast on the documentation part :slight_smile:

1 Like

I'm glad!

I've just filed rust-lang/rust#72186 for fixing up this paragraph and making it less off-putting.

2 Likes

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.