Buffered read and write


#1

Hello.
I want to read, manipulate and write a big file efficiently. I have set an input buffer, and output buffer and an intermediate string buffer on which I will manipulate data. So far I have gotten to this but I can’t find a proper method to write do the buffwriter.

use std::io::{self, BufReader, BufWriter};
use std::io::prelude::*;

fn main() {
    let mut sb = BufReader::new(io::stdin()); // source buffer
    let mut ob = BufWriter::new(io::stdout()); // output buffer
    let mut ib = String::new();  // intermediate buffer
    let _ = sb.read_to_string(&mut ib); 
    // manipulation takes place here. For example, replace all the '\t' with ' '
    // write to the buffer
}
  1. What method should I use to write to ob?

  2. I want to set all of these buffer size to the OS page size? (By OS I mean the OS I am using so it’s known at compile time)

  3. ib should allocate enough memory to contain buffer size.

  4. Even after this batch of File I/O, I would like to do the same thing again later on. I do not want to create a new ib as it’ll allocate more memory. Is it possible to re-use the String buffer. (Note that read_to_string() already consumes one mutable reference to the String.)

Thanks.


#2

To write the buffer:

ob.write_all(ib.as_bytes());

For OS page size, do you want to dynamically discover the page size or hardcode at compile time based on target OS? Not quite clear because page size can be different on the same OS (eg hugepages/THP).

You can allocate a String with reserved capacity upfront: String::with_capacity().

You can reuse the String buffer - just keep it alive somewhere, such as in a field of a struct that lives across the IO batches.


#3

do you want to dynamically discover the page size or hardcode at compile time based on target OS?

I want this to be detected at compile time so when the user compiles the program, it sets the buffersize accordingly. Something similar to http://man7.org/linux/man-pages/man2/getpagesize.2.html so I can set this value in all three of those buffers.


#4

I don’t believe the compiler exposes this information (for good reason, IMO). You might be able to write a build.rs file that queries the OS for page size and then sets that as a CLI/env arg that your code examines.

But really, I’d urge you to not go down this path. You can assume a 4K (or multiple thereof) page just as well. You will have to tune buffer size to get the best IO performance - it’s very unlikely to be a single OS page.

Otherwise, query it at runtime if you really want correct information.


#5

Thanks for the suggestion. Now the only part I am stuck in is to figure out how to read into the input buffer in batches of 4096 bytes until it reaches EOF (or stops reading once the source buffer contains EOF). This is my mental picture (sorry for the noob-ness):

use std::io::{self, BufReader, BufWriter};
use std::io::prelude::*;

fn main() {
    let cap: usize = 4096;
    let mut sb = BufReader::with_capacity(cap, io::stdin()); // source buffer
    let mut ob = BufWriter::with_capacity(cap, io::stdout()); // output buffer
    let mut ib = String::with_capacity(cap / std::mem::size_of::<char>());  // intermediate buffer
    // while !eof:
    //      ib.push(generated string)
    //      // as I have already set the input buffer limit, I hope it reads only until sb can hold
    //      if ib reaches capacity, move its content to ob; ib.clear()
}

#6

What kind of manipulation are you going to perform on the data? You initially mentioned a single (ascii) char replacement. Something like that can be done in a streaming manner, without an intermediate buffer at all.

But ok, if you need a bounded buffer, then I think you may as well use a stack allocated array (ie [u8; 4096]) that you read_exact() into; EOF will be signaled back to you via the error you get back. If you don’t want it stack allocated you can put it on the heap by boxing it.


#7

This might be useful, using Mmap::open_path("…) to realize a buffered read of a huge file: