How to read() from TcpStream and *append* to Vec<u8> – efficiently

So, I would like to read() some data from a TcpStream and append it a Vec<u8>.

Of course, I could simply read() into a separate [u8] first, and then use extend_from_slice() to append the data to the vector. But I think this would cause an unnecessary copy of the data! :angry:

It would be way better (more efficient) to read into the vector directly.

One option would be to grow the Vec<u8>, by using resize(), in order to have some "extra" space at the end, which we then can pass to read(), as an &mut [u8] slice. This works okay.

Unfortunately, the resize() function must initialize the "extra" space of the vector to a specific value, even though we are going to overwrite that area immediately. This seems wasteful!

Another option would be to use reserve() and spare_capacity_mut(). Unfortunately, the latter gives me an MaybeUninit<u8>, whereas TcpStream::read() needs an &mut [u8] slice...

Is there any way to read() directly into a MaybeUninit<u8> ??? :thinking:


Best workaround to grow the vector without initializing the "extra" space that I came up with is:

fn grow_buffer(buffer: &mut Vec<u8>, new_length: usize) {
    if new_length > buffer.capacity() {
        buffer.reserve(new_length - buffer.len());
    }
    unsafe { buffer.set_len(new_length); }
}

But it requires unsafe, which I would like to avoid :fearful:

(even though I think that, in this specific case, above code should be "safe")

1 Like

The docs for Read::read specifically call this case out, this is potential UB no matter how much unsafe you write:

callers of this method must not assume any guarantees about how the implementation uses buf . The trait is safe to implement, so it is possible that the code that’s supposed to write to the buffer might also read from it. It is your responsibility to make sure that buf is initialized before calling read . Calling read with an uninitialized buf (of the kind one obtains via MaybeUninit<T>) is not safe, and can lead to undefined behavior.

If you did want to try this, it isn't possible without unsafe (on stable) because you are necessarily relying on implementation details of TcpStream::read.

On nightly you would use read_buf

2 Likes

I think you do not need to worry about the initialization. It is likely neglectable.

You can try the nightly BorrowedBuf and BorrowedCursor and compare the performance.

Edit: Something like this

pub fn tcp_write_curs(vec: &mut Vec<u8>, tcp: &mut TcpStream, additional: usize) -> Result<()> {
    vec.reserve(additional);
    let mut buf: BorrowedBuf = vec.spare_capacity_mut().into();

    let mut cursor = buf.unfilled();
    tcp.read_buf(cursor)?;

    let new_len = vec.len() + buf.len();

    unsafe {
        // TODO: get rid of unsafe
        vec.set_len(new_len);
    }

    Ok(())
}

1 Like

If you're using Tokio, then this can be done via the AsyncReadExt::read_buf method.

If you're using the standard library blocking IO, then it's more difficult.
I recently shared this with someone:

use std::fs::File;
use std::io::{Read, Result};
use bytes::BufMut;

fn read_into_spare_cap<B: BufMut>(mut file: &File, buf: &mut B) -> Result<usize> {
    let len = {
        let chunk = buf.chunk_mut();
        let len = chunk.len();
        let ptr = chunk.as_mut_ptr();
        // SAFETY: The file is not going to read from the unintialized data.
        let slice = unsafe {
            std::slice::from_raw_parts_mut(ptr, len)
        };
        Read::read(&mut file, slice)?
    };
    
    // SAFETY: The file just initialized that many bytes.
    unsafe {
        buf.advance_mut(len);
    }
    Ok(len)
}

It's written for File but will also work with TcpStream. The snippet is only correct because we happen to know that File/TcpStream is implemented in a way where it wont read the uninitialized memory.

1 Like

Interesting :face_with_spiral_eyes:

But I don't see how it can make any difference whether we initialize the destination buffer before calling TcpStream::read(), or whether we just leave it in an "uninitialized " state.

First of all, why should the read() function ever read the initial content of the destination buffer?

Furthermore, even if the read() function actually does read from the destination buffer, for wherever reason, then it clearly must not make any assumptions about the buffer's initial contents! I mean, there is no requirement that we initialize the buffer to a specific value (e.g. 0x00), but only that we initialize it to some value. In other words, we are free to initialize the buffer to any arbitrary value of our liking. Effectively, that is no different from just leaving the buffer "uninitialized" – isn't it ???

This has to do with who gets the blame when unsafe code gets involved. Uninitialized data is special in various ways. I wont get into the details; you can find many old threads on this. Anyway, the point is that if your application ends up doing something that triggers undefined behavior and only has one unsafe block, then that unsafe block is at fault — no matter what. Since you can safely implement Read in a way that reads from the provided slice (even if you shouldn't!), this means that the unsafe block is ultimately at fault for the undefined behavior if it happens.

2 Likes

Yeah, I could understand this, if the read() function would require the caller to initialize the buffer to all zero. But it doesn't! It only requires the buffer to be initialized – to some totally arbitrary values of our liking. And that is, as far as the read() function is able to tell, technically no different from "uninitialized" data. After all, the definition of an "uninitialized" buffer is that it contains arbitrary byte values...

(I mean, there is no "magic" marker to tell whether a certain memory address has been initialized or not. The "uninitialized" memory address contains a value that is no different from any value at an initialized memory address. It's just that the value at the "uninitialized" memory address is, well, arbitrary)

Thanks! Unfortunately, I cannot use "unstable" features in my crate :thinking:

Ununitialized memory is fundamentally different from an arbitrary byte value as far as the compiler is concerned. See

6 Likes

You could use Read::take() followed by Read::read_to_end() for this:

use std::{
    io::{self, prelude::*},
    net::TcpStream,
};

pub fn read_to_vec(stream: &TcpStream, dest: &mut Vec<u8>, max_len: usize) -> io::Result<usize> {
    let max_len: u64 = max_len.try_into().unwrap_or(u64::MAX);
    stream.take(max_len).read_to_end(dest)
}

However, TcpStream in particular does not currently support writing directly to an uninitialized buffer: it uses the default read_buf() implementation which zeroes out the buffer first. There's no way around that without using platform-specific functions, but it seems like something that shouldn't be too difficult to add to the standard library. If it is added, then .take().read_to_end() will automatically switch to the non-zeroing behavior.

Also, this takes a specific length limit, instead of just performing a single read(). I don't know any way around that, since using a custom wrapper type would prevent a specialized read_buf() implementation from working.

1 Like

This is not what it means for a region of memory to be uninitialized.

When we say 'region of memory is uninitialized' we are saying precisely nothing about the bit pattern in that region of memory. To put it differently, making any statement about the bit pattern in a region of memory ("arbitrary", "zero", equal to 0x01FAC6...,any statement at all) already presupposes that the region is in fact initialized. By saying 'unitialized' we preclude any such statements.

This might seem overly pedantic and not very useful. So alternatively you can think of: uninitialized memory is that memory which the compiler knows will never be read, and can optimize accordingly. Of course, initialized-ness is a temporal property - memory can change from uninitialized to initialized and v.v.

2 Likes

Okay. I understand that the compiler may actually keep track of "uninitialized" memory, in order to do some optimizations that otherwise wouldn't be possible. For example, the result of a comparison with some "uninitialized" data is undefined by definition, so the compiler can simply throw away such comparison and assume the result of the comparison to be whatever the compiler likes.

Still, I don't see how the read() function can ever do anything reasonable with the initial contents of the buffer. If the buffer was in "uninitialized" state, then reading its contents and doing anything based on those contents will result in undefined behavior. But: If the buffer was indeed initialized, to some arbitrary value of the callers choice – again, there is no requirement that the buffer must be initialized to a specific value (e.g. zero) – then reading its contents and doing anything based on those arbitrary values will result in undefined behavior (or better call it "unpredictable" behavior) just as well...

Also, what could be the reason for read() to read the buffer contents before overwriting them ???

Oh, it certainly can't. No correct implementation of Read is ever going to read from the provided slice.

If you're not writing a library that accepts an arbitrary user-provided implementation of Read, then its not a relevant concern for you, because you don't have to write defensively against weird incorrect implementations of Read.

Undefined behavior has the very specific meaning of "unsafe code gone wrong", so calling it "unpredictable behavior" instead is indeed more appropriate if unsafe code is not involved. The standard term for "can do weird things, but it isn't going to trigger undefined behavior" is "unspecified behavior". For an example of this distinction in the wild, see e.g. the documentation for HashMap:

It is a logic error for a key to be modified in such a way that the key’s hash, as determined by the Hash trait, or its equality, as determined by the Eq trait, changes while it is in the map. This is normally only possible through Cell, RefCell, global state, I/O, or unsafe code. The behavior resulting from such a logic error is not specified, but will be encapsulated to the HashMap that observed the logic error and not result in undefined behavior. This could include panics, incorrect results, aborts, memory leaks, and non-termination.

This snippet makes it clear that changing a key's hash while its in the HashMap will result in all sorts of bad stuff, but that it will not result in the unsafe code in the implementation of HashMap doing something invalid.

This is the same point that is being made here. Libraries that accept an arbitrary user-provided Read implementation may not have its unsafe code do something invalid, even if the Read is implemented in a non-sensical incorrect way, as long as the Read impl doesn't use unsafe.

4 Likes

Okay, that makes sense to me :smile:

So, coming back to the original question, then the following technique of pre-allocating some "extra" space in the vector and then reading directly into it should be "fine" for practical purposes:

const MAX_READ_SIZE: usize = 4096;

fn read_append(buffer: &mut Vec<u8>, stream: &mut TcpStream) {
	let old_length = buffer.len();
	grow_buffer_unsafe(buffer, old_length + MAX_READ_SIZE);
	let count = stream.read(&buffer[old_length...]);
	buffer.truncate(old_length + count);
}

fn grow_buffer_unsafe(buffer: &mut Vec<u8>, new_length: usize) {
    if new_length > buffer.capacity() {
        buffer.reserve(new_length - buffer.len());
    }
    unsafe { buffer.set_len(new_length); }
}

It appears that better alternatives, which accomplish the same thing and avoid an explicit unsafe, are only available by resorting to "unstable" (nightly), which I don't want to do, at this point.

1 Like

It's generally considered good practice to call set_len after you have initialized the memory. You can do that like this:

use std::io::Read;
use std::net::TcpStream;
use std::mem::MaybeUninit;

const MAX_READ_SIZE: usize = 4096;

unsafe fn as_initialized(slice: &mut [MaybeUninit<u8>]) -> &mut [u8] {
    let len = slice.len();
    let ptr = slice.as_mut_ptr() as *mut u8;
    std::slice::from_raw_parts_mut(ptr, len)
}

pub fn read_append(buffer: &mut Vec<u8>, stream: &mut TcpStream) -> std::io::Result<usize> {
    buffer.reserve(MAX_READ_SIZE);
    let old_length = buffer.len();
    let spare_cap = unsafe { as_initialized(buffer.spare_capacity_mut()) };
	let count = stream.read(spare_cap)?;
	unsafe {
	    buffer.set_len(old_length + count);
	}
	Ok(count)
}
3 Likes

It makes a difference because you need to pass a reference to it (either in the form of a slice or a vector), and references must always point to valid data. (Validity includes the requirement of being initialized, among others.) The mere existence of a reference-to-uninitialized is already UB in itself, regardless of who does or does not read it.

2 Likes

Note that this rule is still up for debate. Requiring that all references point to valid values is conceptually nice, but it prohibits a lot of de-facto useful patterns, including (among other things) passing an uninitialized buffer to Read::read() when you know it won't be read from before being initialized. So I wouldn't state that assertion quite so confidently.

1 Like

Is there a reason why function like read() was not defined to take a MaybeUninit<T> to begin with?

Well, for one thing, std::io::Read predates std::mem::MaybeUninit by 4 years.

Um, actually.

If the behavior is up to debate, then by (lack of) definition it is undefined.

Pedantry aside, I don't see any value in such commentary. If the behavior is being worked out, then the only correct thing to do is assume it's not allowed until it explicitly is. Doing otherwise makes code prone to silent future breakage and UB, basically the worst thing that can happen w.r.t. soundness.