Is libc::memcpy of uninitialized bytes UB?

#[repr(C)]
#[derive(Debug)]
struct StructWithPadding { a: u8, b: u32 }

fn freeze<T: ?Sized>(x: &T) -> Vec<u8> {
    let mut out = vec![0; std::mem::size_of_val(x)];
    unsafe {
        libc::memcpy(
            out.as_mut_ptr() as *mut libc::c_void,
            x as *const T as *const libc::c_void,
            out.len(),
        );
    }
    out
}

fn main() {
    // [255, 00, 00, 00, 255, 255, 255, 255]
    println!("{:?}", freeze(&StructWithPadding { a: 0xff, b: 0xffff_ffff }));
}

libc::memcpy here is an extern "C" fn. I would expect that the compiler cannot see through it, and therefore, as far as the rust compiler can tell, this program does not invoke undefined behavior. Or does it?


Edit: Actually, this is closer to what I'm really trying to do. Forget the Vec<u8>, and let's leave out what I actually do with the pointer afterwards (because whatever I do with it would require unsafe).

fn freeze<T: ?Sized>(x: &T, out: *mut u8) {
    unsafe {
        libc::memcpy(
            out as *mut libc::c_void,
            x as *const T as *const libc::c_void,
            std::mem::size_of_val(x),
        );
    }
}

(but for the record, the thing I actually do with out afterwards is simply cast it to a &mut T)

1 Like

memcpy is special, and LLVM has an inherent understanding of its semantics.

To prove that LLVM knows that libc::memcpy is special, I entered this in the playground:

pub fn freeze<T: ?Sized>(x: &T, out: *mut u8) {
    unsafe {
        libc::memcpy(
            out as *mut libc::c_void,
            x as *const T as *const libc::c_void,
            std::mem::size_of_val(x),
        );
    }
}

pub fn freeze_i32(x: &i32, out: *mut u8) {
    freeze(x, out)
}

Show LLVM IR and see that no memcpy remains:

; playground::freeze_i32
; Function Attrs: nofree norecurse nounwind nonlazybind uwtable
define void @_ZN10playground10freeze_i3217h40aceaf0f896da25E(i32* noalias nocapture readonly align 4 dereferenceable(4) %x, i8* nocapture %out) unnamed_addr #0 {
start:
  %x.val = load i32, i32* %x, align 4
  %0 = bitcast i8* %out to i32*
  store i32 %x.val, i32* %0, align 1
  ret void
}
2 Likes

Okay, in that case, I'm not sure that I trust my first function. But the second function should be safe, right?

I've also written this alternative formulation in terms of MaybeUninit, but I imagine it won't be as fast: (haven't benchmarked anything)

fn freeze<T: ?Sized>(x: &T, out: *mut u8) {
    let len = std::mem::size_of_val(x);
    let out = unsafe { std::slice::from_raw_parts_mut(out as *mut MaybeUninit<u8>, len) };
    let x = unsafe { std::slice::from_raw_parts(x as *const _ as *const MaybeUninit<u8>, len) };
        
    out.copy_from_slice(x);
}

(also freeze probably isn't the best name for this anymore; It's more like just copy)

Given the following input code:

use libc; // 0.2.66
use std::mem::MaybeUninit;

pub fn freeze1<T: ?Sized>(x: &T, out: *mut u8) {
    unsafe {
        libc::memcpy(
            out as *mut libc::c_void,
            x as *const T as *const libc::c_void,
            std::mem::size_of_val(x),
        );
    }
}

pub fn freeze_i32_1(x: &i32, out: *mut u8) {
    freeze1(x, out)
}

pub fn freeze2<T: ?Sized>(x: &T, out: *mut u8) {
    let len = std::mem::size_of_val(x);
    let out = unsafe { std::slice::from_raw_parts_mut(out as *mut MaybeUninit<u8>, len) };
    let x = unsafe { std::slice::from_raw_parts(x as *const _ as *const MaybeUninit<u8>, len) };
        
    out.copy_from_slice(x);
}

pub fn freeze_i32_2(x: &i32, out: *mut u8) {
    freeze2(x, out)
}

Playground
The following output code is produced by LLVM:

; playground::freeze_i32_1
; Function Attrs: nofree norecurse nounwind nonlazybind uwtable
define void @_ZN10playground12freeze_i32_117hc4e6a52422897ac7E(i32* noalias nocapture readonly align 4 dereferenceable(4) %x, i8* nocapture %out) unnamed_addr #0 {
start:
  %x.val = load i32, i32* %x, align 4
  %0 = bitcast i8* %out to i32*
  store i32 %x.val, i32* %0, align 1
  ret void
}

; playground::freeze_i32_2
; Function Attrs: nofree norecurse nounwind nonlazybind uwtable
define void @_ZN10playground12freeze_i32_217h14aa192a4d1ec1d0E(i32* noalias nocapture readonly align 4 dereferenceable(4) %x, i8* nocapture %out) unnamed_addr #0 {
start:
  %0 = bitcast i8* %out to i32*
  %1 = load i32, i32* %x, align 4, !alias.scope !2
  store i32 %1, i32* %0, align 1, !noalias !2
  ret void
}

Which both compile to the same assembly:

playground::freeze_i32_1:
	movl	(%rdi), %eax
	movl	%eax, (%rsi)
	retq

playground::freeze_i32_2:
	movl	(%rdi), %eax
	movl	%eax, (%rsi)
	retq

If miri complains, it's definitely UB. Pick miri from the tools in this playground. The type u8 is not allowed to be uninitialized, so even if you're using MaybeUninit while writing, afterwards you have uninitialized data in an u8 array, which is UB.

1 Like

Note that the memcpy itself is not UB, but the println! is. Having uninitialized integers or reference to it is not UB in Rust. We tried to make it UB, but it will break many crates in the wild which is written before the MaybeUninit stabilized. But using uninitialied bytes is always UB.

4 Likes

Hm... Does this mean freeze should be marked unsafe because it's up to the caller to ensure they don't access the bytes that correspond to the padding of x?

Yes. And also it writes to given raw ptr which obviously make it unsafe. Safe fn must be safe for every possible input, including nullptr, a ptr to deallocated buffer, or even 0xDEADBEEF as *mut u8.

EDIT: Didn't see the edit in the original post.

As an aside, if we have cross-language LTO is this a meaningful statement?

1 Like

I think cross-lang LTO is past the point of concern -- the immediate issue is that LLVM will treat memcpy as an intrinsic that it can optimize internally. Once you get to LTO, UB would be up to the semantics of each language translated into LLVM IR, whether flags like noalias and undef are used. I guess that does mean that Rust's undef could propagate into the optimization of the C side though, if that's inlined.

1 Like

I guess even without LTO (and memcpy being a special case) UB in a multi-language process is the union of all languages UB (C++ trashing Rust stack objects are still trashed stack objects). But I'm wondering whether it's safe, for example, for C++ code to create multiple mutable aliases, violating Rust's constraints? Only if those aliases become semantically visible to Rust code?

1 Like

I think those aliases are fine until they're tagged noalias in LLVM IR. Clang should only add that for restrict pointers, and Rust will (someday) add that for &mut T, but Rust raw pointers are fine to alias. So these worlds can already coexist, and you don't have a problem until you translate those pointers to Rust references.

1 Like

The following is sound:

fn freeze<'out, T : 'out> (x: &'_ T, out: &'out mut [MaybeUninit<u8>])
  -> &'out mut T
where
    T : Copy, // to make the `ptr::read()` sound, else the function is unsafe
{ unsafe {
    assert!(out.len() >= mem::size_of_val(x));
    let out: *mut MaybeUninit<T> = 
        out.as_mut_ptr().cast()
    ;
    out.write( <*const T>::cast::<MaybeUninit<T>>(x).read() );
    &mut *out.cast::<T>()
}}

But if out: &'out mut [u8], I am not sure there is a way to do such write in a way that is guaranteed to be sound, since one cannot overwrite initialized bytes with garbage bytes (since these are automatically assume_init()-ed).

Just like looking at assembly isn't a sufficient way to understand that LLVM IR is free from LLVM-UB it is also not sufficient to look at LLVM IR and conclude that it is free from Rust-spec UB. libc's memcpy being special needs to be argued not in terms of LLVM but in terms of the Rust spec.

If you are saying that let x: &u8 = mem::uninitialized(); or let x: u8 = mem::uninitialized(); is not UB then that's wrong. Merely writing that let statement without actually "using" x is instant UB according to the spec.

let x: &u8 = mem::uninitialized(); is insta-UB as it violates validity of the type &u8 - it can't be NULL. But let x: u8 = mem::uninitialized; and even x: &u8 = &mem::uninitialized(); is ok, though unfortunate.

It's mainly because there's too many let mut buf: [u8; 1024] = mem::uninitialized(); in the wild, and even bytes@0.4 which was de facto buffer mgmt crate for async Rust has unsafe fn ByteMut::bytes_mut() which yields &mut [u8] to uninitialized buffer. We can't just make all the tokio@0.1 based production codes UB and let them break on compiler version-up.

Disclaimer: the bytes@0.5, dependency of the async-await-aware tokio@0.2, fixed this issue by making it yields &mut [MaybeUninit<u8>].

2 Likes

Out of curiosity: has the undefined-primitive-integers decision gone through RFC or otherwise been actually decided for the language spec, or is it only the UCG's discussions that you're basing your information here off of?

I think I agree both with that ideally primitive integer types wouldn't allow undefined state (rather, use MaybeUninit<{int}>), and that this probably is an intractable decision to make, primarily because it would mean that mem::uninitialized is basically equivalent to hint::unreachable_unchecked, since any type that isn't a union would be invalid to have uninitialized bytes.

Currently maintained code should definitely migrate to using MaybeUninit, though as even if primitive integers are decided to allow uninitialized states, MaybeUninit is easier to use correctly in the face of potential uninitialized state.

(That said references to undefined values is distinct from undefined values in a place typed at a primitive integer type. The former can theoretically be defined even if the latter isn't, though it would be harder to create.)

1 Like

No, this is incorrect. let x: u8 = mem::uninitialized();, let x: [u8; N] = mem::uninitialized();, etc. are all instant UB without then going on to "use" x because you are assigning 0xUU to a value of type u8 which requires a value 0u8..256u8. Per the reference:

Producing an invalid value, even in private fields and locals. "Producing" a value happens any time a value is assigned to or read from a place, passed to a function/primitive operation or returned from a function/primitive operation. The following values are invalid (at their respective type):

  • [...]
  • An integer ( i* / u* ), floating point value ( f* ), or raw pointer obtained from uninitialized memory.

-- https://doc.rust-lang.org/nightly/reference/behavior-considered-undefined.html

and:

Moreover, uninitialized memory is special in that the compiler knows that it does not have a fixed value. This makes it undefined behavior to have uninitialized data in a variable even if that variable has an integer type, which otherwise can hold any fixed bit pattern:

-- https://doc.rust-lang.org/nightly/std/mem/union.MaybeUninit.html#initialization-invariant

Now, we may decide to make this behavior defined (see https://github.com/rust-lang/unsafe-code-guidelines/issues/71 for that discussion) via an RFC, but until such time, this is UB per specification.

2 Likes