How to effectively copy part(s) of a file

Hello,
I need to export a binary file (or parts of it) to another file.
The API I want to implement looks like this:

pub fn export_as_file(
    in_file_path: PathBuf,
    destination_path: PathBuf,
    parts: Vec<FilePart>,
) -> Result<(), Error> {

where the sections contain the info which part of the in_file needs to be exported

struct FilePart {
    offset: usize,
    length: usize,
}

So now I need to export sections of a very large file (>10GB). So I don't want/cannot read the file into memory.
Copying the whole file can be done with std.io.copy
which takes a Read argument as source. So I could pass in a BufReader and seek to an offset. But I don't know how to cap the max size of what can be read from the BufReader.
Any good ideas for this?

It seems like if you're given line offsets you'll have to read the entire file up to the last line you want to copy in any case in order to count lines. Why not just write it as you read it, bufferful by bufferful?

Or are the indices not actually line numbers?

it's a binary file that contains messages which at least needs to be parsed partially to determine the message boundaries.
So offset_message[n] is the offset of the binary data for message n. But anyway, the part I want to save is arbitrary. It needs to work also for files that I do not parse myself.

sorry I had a copy past error when first adding the API description...corrected

Why use a BufReader at all? You could just loop with a buffer of your own, reading only as much as you want. Copying the source of std::io::copy can be a good starting point for this.

FWIW, I would also suggest u64 for your file offsets and lengths, especially since you're dealing with sizes too large for 32-bit. But maybe you only care about 64-bit targets anyway...

1 Like

Are you looking for the Seek trait?

If you accept something which is Read + Seek you'll be able to seek to the correct location in the file then copy length bytes to the output file, repeating for each FilePart in parts.

yes you are right I accept something that is Read + Seek so I can seek to a position...
the std::io::copy API looks like this:

pub fn copy<R: ?Sized, W: ?Sized>(reader: &mut R, writer: &mut W) -> Result<u64>

but doesn't provide a way to limit the number of bytes copied

I got a good idea from my colleague @flxo ...just do a custom implementation of Read and implement read so that it does restrict the maximum number of bytes copied.
here is a draft of that:

struct ChunkReader<'a, T: std::io::Read> {
    n: usize,
    read: &'a mut T,
}
impl<'a, T: std::io::Read> ChunkReader<'a, T> {
    fn new(read: &mut T, n: usize) -> ChunkReader<T> {
        ChunkReader { n, read }
    }
}
impl<'a, T: std::io::Read> std::io::Read for ChunkReader<'a, T> {
    fn read(&mut self, buf: &mut [u8]) -> std::io::Result<usize> {
        if self.n == 0 {
            return Ok(0);
        }
        let len = buf.len();
        let read_bytes = &self.read.read(&mut buf[..std::cmp::min(len, self.n)])?;
        self.n -= read_bytes;
        Ok(*read_bytes)
    }
}

With that it's possible to use std::io::copy like this:

      let mut reader = std::io::BufReader::new(f);
      let mut out_writer = BufWriter::new(out_file);

      for part in partitioner.get_parts() {
          reader.seek(std::io::SeekFrom::Start(part.offset as u64))?;
          let mut chunk_reader = ChunkReader::new(&mut reader, part.length);
          std::io::copy(&mut chunk_reader, &mut out_writer)?;
          out_writer.flush()?;
      }

You've re-implemented the Read::take() :smiley:

1 Like

OMG you are right! thanks for pointing that out!!
that somewhat simplifies my code :wink:

        let mut reader = &mut std::io::BufReader::new(f);
        let out_file = std::fs::File::create(destination_path)?;
        let mut out_writer = BufWriter::new(out_file);

        for part in partitioner.get_parts() {
            reader.seek(std::io::SeekFrom::Start(part.offset as u64))?;
            let mut take = reader.take(part.length as u64);
            std::io::copy(&mut take, &mut out_writer)?;
            reader = take.into_inner();
            out_writer.flush()?;
        }

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.