Is there a way to express a buffer need not be zeroed?

My understanding is that Rust requires a buffer to be initialised to something, even though the values will never be read. Is that right? It seems a bit wasteful to zero memory that is never going to be read.

Is there a simple way(in safe Rust) to express "it doesn't matter how buffer is initialised, don't bother to zero it" ?

I think not, I am checking here whether my understanding is correct. Is there a way perhaps with unsafe Rust and if so, is it actually risky in a simple case of a byte buffer?

2 Likes

It depends on the application, but I'd probably use a type like MaybeUninit<[u8; 1024]> if unsafe makes sense in that context.

The Read::initializer() method and std::io::Initializer type were created for exactly this purpose, but they've been unstable for a couple of years now and I don't know if people are happy enough with the API to begin the stabilisation process.

5 Likes

std::io will get support for uninitialized buffers when the new ReadBuf type is implemented.

This type is defined in RFC 2930. There's some active implementation work in this pull request.

14 Likes

Thanks! I managed to do that, and the program still worked, by some miracle. It was pretty messy though, and if I did it right it was as much luck as good judgement.

My code:

struct HttpRequestParser <'a>
{
  // buffer: [u8;512],
  buffer: std::mem::MaybeUninit<[u8;512]>,
  stream : &'a TcpStream,
  index: usize, // Into buffer  
  count: usize, // Number of valid bytes in buffer
  base: usize,  // Absolute input position = base+index.
  end_content: usize, // End of content.
  content_length : usize,
  content_type : String,
}

impl <'a> HttpRequestParser <'a>
{
  pub fn new( stream: &'a TcpStream ) -> Self
  {
    Self
    { stream, 
      buffer: std::mem::MaybeUninit::<[u8;512]>::uninit(),
      count:0, index:0, base:0, 
      end_content:usize::MAX,
      content_length : 0,
      content_type : String::new(),
    }
  }

  fn get_byte( &mut self ) -> u8
  {
    if self.base + self.index == self.end_content 
    { 
      self.index = self.index + 1;
      return b' ' 
    }
    if self.index == self.count
    {
      self.base = self.base + self.count;
      unsafe
      {
        self.count = self.stream.read( &mut *self.buffer.as_mut_ptr() as &mut [u8] ).unwrap();
      }
      self.index = 0;
      // println!( "Parse request got {} bytes", self.count );
    }
    unsafe 
    {
      let result = (&*self.buffer.as_ptr() as &[u8])[ self.index ];
      self.index += 1;
      result
    }
  }
...

Keep in mind that this is the pointer equivalent of a MaybeUninit::uninit().assume_init() - i.e. assuming that "uninitialized" is a valid bit pattern for [u8] - and I think the jury is still out on that.

5 Likes

This is Undefined Behavior. You can't make a Rust reference from uninitialized data. Mere existence of such reference is interpreted as a proof that the data is fully initialized. This is why std::mem::uninitialized() was deprecated.

With the io::Read trait the only way to write to an uninitialized buffer is by appending to a Vec in read_to_end (you can combine it with read.as_ref().take(n).read_to_end(&mut vec) if you need).

BTW, the situation is even trickier in a generic context. The read implementation could be "evil" and read from the buffer or report an invalid length without filling in the buffer (maybe even return count longer than the whole buffer). Since the read function is marked as "safe", it becomes responsibility of the unsafe caller to avoid trusting its implementation in ways that could cause unsafety.

5 Likes

Unless you're doing some serious zero-copy IO, many times you can just reuse an existing (often fixed-size) buffer to do reads into..

Somewhat related, I believe some OS's optimize allocating all-zeroes pages, so I wonder if there's some way you could leverage that from rust? Though I'm not sure Rust exposes calloc in any way?

vec![0; n] uses calloc internally: Specialize Vec::from_elem to use calloc by mbrubeck · Pull Request #40409 · rust-lang/rust · GitHub

There is also the std::alloc::alloc_zeroed function.

6 Likes

Note that in many cases, Vec::with_capacity(n) is all the uninitialized buffer you need: it allocates room for n items, without initialization, and explicitly tracks (.len()) how many of the items have been initialized so far. It's even possible to write to the uninitialized part and promise that it's now initialized using unsafe .set_len().

If you want to initialize the items out of order, then you need something else, but for reading bytes, see if you can use Vec<u8> before doing unsafe code “from scratch”.

11 Likes

To elaborate, the OS optimization is usually: map new pages to a shared all-zeroes page, and copy on write. It's a waste of time to initialize the page yet again in user space, so that's what calloc could avoid, depending on the allocator.

I'm also seeing indications online that LLVM transforms malloc + memset to calloc in some cases, which is cool.

1 Like

I'm curious what the jemalloc optimizations for calloc are that you mention in the PR?

I don't know of any special optimizations in jemalloc beyond the ones in most system allocators. I just mentioned that I benchmarked it with both jemalloc and alloc_system, since those were the two choices that shipped in the standard library (on nightly) at the time. Jemalloc was the default on most platforms until Rust 1.32.

Yes, It's pretty clear from the documentation the code I posted above is not correct, even though it seems to work, there is an explicit statement to that effect in the documentation ("not safe", "can lead to undefined behaviour" ).

I don't really have any idea why though. Overall, it seems the answer to the post question is a clear "No", at the moment at least.

The "why" is that Rust passes this information to LLVM, and LLVM feeds this information to optimization passes, which in turn can make assumptions based on what they've been told. These assumptions can be propagated and mixed in complex ways, and invalid assumptions can lead to generation of invalid code.

UB has this annoying habit of seeming harmless and working fine most of the time. But wrong inputs to optimization can create logical paradoxes within the compiler, and this makes it optimize "impossible" code so badly it may seem like it's doing it out of spite. I'm not sure it this case specifically causes this in practice, but there are examples where touching uninitialized memory makes LLVM break causality.

6 Likes

I was trying to imagine this morning a semi-realistic implementation of Rust where my unsafe/UB code might actually fail. Here goes...

Imagine a machine with parity-checked memory. Memory starts off in an undefined state where the parity bit has not yet been calculated. There are instructions to "initialise" memory with a value (with a correctly computed parity bit), but if you try to read a byte of memory that has not been initialised the machine stops execution entirely, complaining a parity error has been detected, and has to be restarted. Or maybe even if you write a byte of memory, and use the "wrong" instruction (not an "init" variant ) the parity check fails.

Now also imagine that someone has put tracing into the implementation of Read, so that the buffer ( all of it! ) is printed to the console before it returns. Or the machine does the parity check when a byte is written. Now, when my program calls Read, the attempt to print (or even write) the buffer will fail and the machine will stop with a "parity error".

Does that help show why although the code appears to work, it might not work on all implementations of Rust?

3 Likes

You can probably simplify that explanation even further.

Imagine a DebugReader type which wraps a normal reader and logs the contents like this:

struct DebugReader<R>(R);

impl<R: Read> Read for DebugReader<R> {
  fn read(&mut self, buffer: &mut [u8]) -> Result<usize, Error> {
    let bytes_read = self.0.read(buffer)?;
    log::debug!("Read {} bytes into {:?}", bytes_read, buffer);
    Ok(bytes_read)
  }
}

Now imagine it gets used with an uninitialized buffer...

let mut reader = DebugReader(std::io::stdin());

let uninitialized_buffer: MaybeUninit<[u8; 1024]> = MaybeUninit::uninit();
let mut buffer = unsafe { uninitialized_buffer.assume_init() };

let n = reader.read(&mut buffer)?;

do_something_with_data(&buffer[..n]);
};

It's quite possible that the compiler will see DebugReader's read() method invoking UB (it logs the entire buffer, not just the first bytes_read).

Then, because read() is so trivial it gets inlined into the caller and now the optimiser omits the do_something_with_data(&buffer[..n]) call because it is done after triggering UB and triggering UB can never happen.

4 Likes