Slow Postgres COPY using Buffer

Hi there!!

I'm using the postgres crate in order to COPY some data from one table to a buffer in memory. To do so I'm getting a CopyOutReader returned by the copy_out method of the Postgres client:

let mut reader = client.copy_out(copy_out_query.as_str()).unwrap();

And then I get the data from DB to memory in a loop using the BufReader fill_buf implementation:

loop {
    let buf = reader.fill_buf().unwrap();
    let length = buf.len();

    // If we've reached EOF, we're done
    if length == 0 {
        break;
    }

    // Do here some stuff with my &buf

    // ensure the bytes we worked with aren't returned again later
    reader.consume(length);
}

My problem is that the size of the internal buffer used by default is really small (around 150 bytes) so it's taking too long to get my data from DB.

I need to do it this way as the amount of data is potentially very big so reading it all at once with for instance read_to_end() it's not an option. It would be very fast but would lead to an out of memory error.

I've tried just using my own buffer (a vector of u8) and read method, but it looks like internally it keeps calling fill_buf so it keeps being really slow.

Anyone could tell me a way to increase that buffer size or any other way to improve the speed of this COPY operation?

BufReader initial capacity is I think 8KiB (edit: after checking, the postgres crate uses bytes::Bytes as its internal buffer), it has a with_capacity method to change that.

You can wrap CopyOutReader in your own BufReader.

Hi erelde, thanks a lot for replying.

I've tried what you suggested doing this:

    let reader = source_client.copy_out(copy_out_query.as_str()).unwrap();
    
    let mut buf_reader = BufReader::with_capacity(10000, reader);

And then checking how many bytes were read:

        let buf = buf_reader.fill_buf().unwrap();
        let length = buf.len();
        println!("Read {} bytes", length);

But it seems like it's still using a very small buffer size:

Read 128 bytes

It looks like in the end it was using anyway the fill_buf of CopyOutReader ?

Postgres returns each row as a separate chunk of data, which is probably where the 150 byte number is coming from.

Yes, you're right, I've changed the table I was using for a bigger one and now that number of bytes is also bigger (from 150 to 1300)

So I guess that probably the best approach will be to fetch one by one as much rows as possible and store them in my own byte vector.

Yeah, adding a buffering layer of your own seems like the right approach.

I wonder if the Read interface is the wrong thing to use there. tokio-postgres uses a Stream<Result<Bytes, Error>> which more directly represents the protocol layer. Maybe the postgres version should be an Iterator<Result<Bytes, Error>> similarly...

OK, applying what we said, that worked well for my use case, so I share it in case it could help somebody else in the future:

    let mut reader = client.copy_out(copy_out_query.as_str()).unwrap();

    let mut buffer = vec!();
    let mut total_rows = 0;
    // Keep reading from source until reader is empty
    loop {
        let row = reader.fill_buf().unwrap();
        let row_bytes = row.len();
        
        // If we've reached EOF, end now, writing remaining rows
        if row_bytes == 0 {
            if buffer.len() > 0 {
                // Do some stuff with remaining rows
            }
            break;
        }
    
        buffer.extend(row);
        total_rows = total_rows + 1;

        if total_rows % 50000 == 0 {
            // Do some stuff with last 50K rows
            buffer = vec!();
        }

        // ensure the bytes we worked with aren't returned again later
        reader.consume(row_bytes);
    }

Thanks a lot to both of you for your help, you really made my day !

1 Like