How to efficiently re-use a `Vec<u8>` for multiple `Read::read`?

Hi,

We have a sequence of long (sometimes 1Mb) reads from Read::read_exact whose data is decoded/deserialized into an owned structure, something like

use std::io::{Read, Error};

struct A;

fn deserialize(data: &[u8]) -> Result<A, Error> {
    todo!()
}

fn read_many<'a, R: Read + 'a>(
    mut reader: R,
) -> impl Iterator<Item = Result<A, Error>> + 'a {
    (0..10).map(move |_| {
        let mut bytes = [0; 4];
        reader.read_exact(&mut bytes)?;
        let length: u32 = u32::from_le_bytes(bytes);

        let mut data = vec![0; length as usize];
        reader.read_exact(&mut data)?;
        deserialize(&data)
    })
}

We would like to re-use the allocated region as much as possible. Our current approach is to use


/// A small wrapper around [`Vec<u8>`] that allows us to reuse memory once it is initialized.
/// This may improve performance of the [`Read`] trait.
#[derive(Clone, Default)]
pub struct ReadBuffer {
    data: Vec<u8>,
    // length to be read or is read
    length: usize,
}

impl ReadBuffer {
    /// Set the minimal length of the [`ReadBuf`]. Contrary to the
    /// method on `Vec` this is `safe` because this function guarantees that
    /// the underlying data always is initialized.
    pub fn set_len(&mut self, length: usize) {
        if length > self.data.capacity() {
            // exponential growing strategy
            self.data = vec![0; length * 2];
        } else if length > self.data.len() {
            self.data.resize(length, 0);
        }
        self.length = length;
    }
}

impl AsRef<[u8]> for ReadBuffer {
    fn as_ref(&self) -> &[u8] {
        &self.data[..self.length]
    }
}

impl AsMut<[u8]> for ReadBuffer {
    fn as_mut(&mut self) -> &mut [u8] {
        &mut self.data[..self.length]
    }
}

fn read_many1<'a, R: Read + 'a>(mut reader: R) -> impl Iterator<Item = Result<A, Error>> + 'a {
    let mut data: ReadBuffer = Default::default();
    (0..10).map(move |_| {
        let mut bytes = [0; 4];
        reader.read_exact(&mut bytes)?;
        let length = u32::from_le_bytes(bytes) as usize;

        data.set_len(length);
        reader.read_exact(data.as_mut())?;
        deserialize(data.as_ref())
    })
}

does anyone use something like this or know an even better way of doing this?

I've done something pretty similar before, but with just a plain 'ol Vec:

fn read_many<'a, R: Read + 'a>(mut reader: R) -> impl Iterator<Item = Result<A, Error>> + 'a {
    let mut buffer = Vec::new();
    (0..10).map(move |_| {
        let mut bytes = [0; 4];
        reader.read_exact(&mut bytes)?;
        let length = u32::from_le_bytes(bytes) as usize;

        // Ensure the buffer is large enough to handle the payload.
        if length > buffer.capacity() {
            buffer.resize(length, 0);
        }

        let window = &mut buffer[..length];
        reader.read_exact(window)?;
        deserialize(window)
    })
}

But I also think that making a separate data type to handle the buffering is fine : v)

In this I'm just letting the vector use it's own allocation strategy with resize and the minimum capacity it needs to have, but you could be fancy and pass in your own computed value to 'force' an exponential growing strategy.

Addendum: Also, if you're trying to squeeze efficiency out of your reader, it's probably worth looking at BufReader as well. It maintains it's own buffer for doing large reads at a time, then letting you read slowly over the buffered data. Slightly different to what you're trying to do, but definitely adjacent!

Is your custom growth strategy really worth it? You could just use a Vec.

You can avoid re-initializing everything but newly exposed parts of your buffer, by never shrinking the length and passing around slices:

fn read_many2<R: Read>(mut reader: R) -> impl Iterator<Item = Result<A, Error>> {
    let mut data = Vec::new(); // vec![0; reasonable_expectation];
    (0..10).map(move |_| {
        let mut bytes = [0; 4];
        reader.read_exact(&mut bytes)?;
        let length = u32::from_le_bytes(bytes) as usize;

        // Never shorten the known-initialized length
        // Maybe worth it to spell this out in an `if` to possibly
        // avoid a call to `resize`
        let max = data.len().max(length);
        data.resize(max, 0);

        // Check that these bound checks are eliminated or throw in an
        // assert! maybe
        let this = &mut data[..length];
        reader.read_exact(this)?;
        deserialize(this)
    })
}
2 Likes

In our benches it shows a factor of 10x:

read1 2^10              time:   [4.7600 us 4.8518 us 4.9582 us]
read2 2^10              time:   [4.4994 ms 4.5141 ms 4.5288 ms]
read1 2^12              time:   [4.3802 us 4.4795 us 4.6010 us] 
read2 2^12              time:   [4.4811 ms 4.4940 ms 4.5068 ms]
read1 2^14              time:   [5.2996 us 5.3199 us 5.3390 us]                        
read2 2^14              time:   [4.5012 ms 4.5135 ms 4.5258 ms]
use criterion::{criterion_group, criterion_main, Criterion};

use bench_read::{read_many1, read_many2};

fn add_benchmark(c: &mut Criterion) {
    (10..=20).step_by(2).for_each(|log2_size| {
        let size = 2usize.pow(log2_size);

        let mut data = vec![];
        for i in 0..10 {
            data.extend_from_slice(&((i + size) as u32).to_le_bytes());
            data.extend(std::iter::repeat((i % 255) as u8).take(size));
        }

        c.bench_function(&format!("read1 2^{}", log2_size), |b| {
            b.iter(|| {
                let reader = std::io::Cursor::new(&data);
                let _ = read_many1(reader).count();
            })
        });

        c.bench_function(&format!("read2 2^{}", log2_size), |b| {
            b.iter(|| {
                let reader = std::io::Cursor::new(&data);
                let _ = read_many2(reader).count();
            })
        });
    });
}

criterion_group!(benches, add_benchmark);
criterion_main!(benches);

imo this makes sense - the issue is that resize may trigger a realloc which will cause everything on it to be copied to the new region.

Uh, that’s not a factor of 10x, it’s a factor of 1000x, which is a pretty clear indication that the benchmarks are not measuring the same thing.

5 Likes

Instead of vec.resize(len); read_exact(&mut vec[..])? you can use:

vec.clear();
vec.try_reserve(len)?;
reader.by_ref().take(len).read_to_end(&mut vec)?;
assert_eq!(vec.len(), len);

This avoids initiailizing memory. Also try_reserve might help avoid killing your program if the length is too large.

6 Likes

Thank you all!

There was a bug on the bench itself. Fixed it now. Here are the results

  • read0 is always alloc
  • read1 is ReadBuffer
  • read2 is @quinedot's suggestion
  • read3 is @kornel's suggestion
read0 2^10              time:   [722.39 ns 724.16 ns 726.29 ns]
read1 2^10              time:   [397.72 ns 398.60 ns 399.64 ns]
read2 2^10              time:   [549.24 ns 550.45 ns 551.62 ns]
read3 2^10              time:   [594.33 ns 596.17 ns 598.21 ns]
read0 2^12              time:   [1.7400 us 1.7454 us 1.7532 us]
read1 2^12              time:   [993.64 ns 996.60 ns 999.65 ns]
read2 2^12              time:   [1.0136 us 1.0150 us 1.0162 us]
read3 2^12              time:   [1.0151 us 1.0183 us 1.0220 us]
read0 2^14              time:   [4.8709 us 4.8896 us 4.9090 us]
read1 2^14              time:   [4.3353 us 4.3417 us 4.3482 us]
read2 2^14              time:   [4.2929 us 4.2997 us 4.3069 us]
read3 2^14              time:   [4.3443 us 4.3579 us 4.3706 us]
read0 2^16              time:   [23.306 us 23.356 us 23.410 us]
read1 2^16              time:   [16.682 us 16.707 us 16.733 us]
read2 2^16              time:   [15.837 us 15.864 us 15.893 us]
read3 2^16              time:   [15.376 us 15.396 us 15.416 us]
read0 2^18              time:   [94.371 us 94.617 us 94.852 us]
read1 2^18              time:   [72.739 us 72.855 us 72.979 us]
read2 2^18              time:   [69.133 us 69.248 us 69.362 us]
read3 2^18              time:   [66.100 us 66.237 us 66.359 us]
read0 2^20              time:   [471.08 us 472.91 us 474.84 us]
read1 2^20              time:   [344.30 us 345.64 us 347.24 us]
read2 2^20              time:   [313.31 us 317.02 us 323.00 us]
read3 2^20              time:   [296.12 us 297.16 us 298.27 us]

I have published the full code here GitHub - DataEngineeringLabs/bench-read-rs: Bench strategies to reuse a `Vec` when calling `Read` multiple times for future reference (and in case someone wants to further play with it).

2 Likes

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.