Confusion converting pointer access from C to Rust

I've been porting some C code as an exercise as someone wrote a C function that's apparently faster than what I use at work.
I ported the C function to Rust, and when I started running tests on the output, I was getting a different result.

I used c2rust to transpile the C code, and then carefully debugged my Rust version and the transpiled code. I found where my code is wrong. I excluded the C code or the full Rust output as I was able to minimise the code. See below.

/// This matches what the C code does
pub fn cast(bytes: [u8; 32]) -> u32 {
    let x = unsafe { *(&bytes[0] as *const u8 as *mut u32) };
    x.swap_bytes()
    // x
}

/// This doesn't match what the C code does
pub fn cast2(bytes: [u8; 32]) -> u32 {
    let x = bytes[0] as u32;
    x.swap_bytes()
    // x
}

I understand the code to take a u8, cast it to u32 and then return (x as u32).swap_bytes().
What's confusing however is that (bytes[i] as u32).swap_bytes() returns a different output to the one where we cast the reference to a pointer.

Is there something basic that I am missing?

Here's a repro: Compiler Explorer

Running the code on the playground shows this behaviour. Rust Playground. The below will output different values despite accessing the same value from the array.

fn main() {
    let bytes = [1u8; 1];
    let b = unsafe {*(&bytes[0] as *const u8 as *mut u32) };
    dbg!(b, b.swap_bytes());
    
    let bytes = [1u8; 32];
    let b = unsafe {*(&bytes[0] as *const u8 as *mut u32) };
    dbg!(b, b.swap_bytes());
}

With the output

[src/main.rs:4] b = 12289
[src/main.rs:4] b.swap_bytes() = 19922944
[src/main.rs:8] b = 16843009
[src/main.rs:8] b.swap_bytes() = 16843009

Without having read much of your question at all yet: When running unsafe code in the playground and observing any kind of “weird” behavior, always try miri :wink:

Under TOOLS -> Miri (top right corner). You can also install it locally via rustup if you want to use it on your computer.


For the playground in question, the immediate feedback is: Your code has undefined behavior!

error: Undefined Behavior: dereferencing pointer failed: alloc1768 has size 1, so pointer to 4 bytes starting at offset 0 is out-of-bounds
 --> src/main.rs:3:21
  |
3 |     let b = unsafe {*(&bytes[0] as *const u8 as *mut u32) };
  |                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ dereferencing pointer failed: alloc1768 has size 1, so pointer to 4 bytes starting at offset 0 is out-of-bounds
  |
  = help: this indicates a bug in the program: it performed an invalid operation, and caused Undefined Behavior
  = help: see https://doc.rust-lang.org/nightly/reference/behavior-considered-undefined.html for further information
  = note: BACKTRACE:
  = note: inside `main` at src/main.rs:3:21: 3:58

I’ll try to look at the actual question next and see if I can add any further useful answers or explanations.

2 Likes

Yeah, there are two things, one is basic, the other is not obvious.

The fundamental problem is that when you cast a pointer to a different type then dereference it, then naturally, the dereference will read the referred value as if it were the appropriate type. So when you have a bunch of u8 values in an array, and you try to read a u32 from the address of the first one, then you will read the first 4 bytes.

This is, of course, not the same as reading only the 1st byte and then casting that single byte to a u32. The former actually reads the 4 consecutive bytes at the given address, while the latter simply reads one byte, and casts it to a wider integer type, preserving its value.

So if you e.g. have the bytes 0x01 0x02 0x03 0x04 at the given address, then the pointer-based type punning approach will read 0x04030201 into a u32 (assuming little endian), but indexing the first byte and then casting will simply result in 0x00000001 as a u32.

The other problem is that in Rust, it is disallowed to access a value to a pointer that doesn't point to the entirety of a value. Thus, reading 4 bytes through the address of only a single byte is UB. Furthermore, the alignments don't match, either – reading a u32 must happen from a 4-byte-aligned address, whereas bytes only have an alignment of 1.

What you should be doing is probably this:

/// This matches what the C code does
pub fn cast(bytes: [u8; 32]) -> u32 {
    let le_bytes: [u8; 4] = bytes[0..4].try_into().unwrap();
    u32::from_le_bytes(le_bytes)
}
4 Likes

The conceptual difference between bytes[0] as u32 and &bytes[0] as … as *const u32 is that the latter will try to read four bytes together as one u32.

There’s multiple problems why the concrete approach taken here is actually undefined behavior, but those aren’t so important for this discussion. (The main two points are the &bytes[0] is not allowed to read anything beyond the first u8 element, and the other problem is that casting the pointer like this can result in an unaligned read.)


Regarding safe solutions to reproduce the (presumably) intended behavior: The safe Rust way to interpret four u8s as a single u32 in a machine-dependent manner is via u32::from_ne_bytes.

fn main() {
    let bytes = [1u8; 32];
    let b = u32::from_ne_bytes(bytes[0..4].try_into().unwrap());
    dbg!(b, b.swap_bytes());
}
2 Likes

Thanks @H2CO3 and @steffahn. I've seen this before, but it didn't register in my mind as what's happening.
I've changed my code to get 4 bytes and cast them to u32, and now my output agrees with the C one.

That's true for ordinary reads via a dereference or using std::ptr::read, but note that you can use std::ptr::read_unaligned to read from a non necessarily aligned pointer. One doesn't need to manually code alignment fixup, although it may be beneficial for performance if you're doing lots of potentially unaligned reads. See this example (run with Tools > Miri). Note that the undefined behaviour happens on ordinary pointer dereference.

That part is tricky and requires more explanation. Having a *const u8, casting it to *const u32 and reading the resulting u32 on its own isn't UB. Rust doesn't have typed memory, so in principle such tricks are entirely legal. However, pointers in Rust carry a hidden property called "provenance", which is quite nebulous and AFAIK isn't properly defined anywhere at all. We know only some parts: you are not allowed to violate pointer provenance, and the pointer's provenance includes a specific range of memory that it is allowed to access. Any accesses outside of that range via any derived pointer is instant UB.

Consider this example:

let buf = [0u8; 16];
let slice = &buf[8..];

Here, slice: &[u8] is a reference to a half of the original buffer. When you do subslicing, the pointer's allowed memory range shrinks. You cannot use slice to read outside that range. No matter what you do. So this would be UB because you're trying to read outside the pointer's memory range:

let p: *const u8 = (slice as *const [u8]).cast();
ptr::read(p.offset(-1));

In fact, simply creating an invalid pointer via p.offset(-1) is likely already UB, and you would need to use p.wrapping_offset(-1) to delay that UB until the actual access.

Pointer provenance is assigned when the pointer is created. If you cast a safe reference to a pointer, you get the same pointer provenance that the reference had (and also the same mutability and aliasing restrictions). If you use ptr::addr_of! to create a raw pointers without any references, you still have the provenance enforced. By default you get the provenance of the pointed-to value (i.e. ptr::addr_of!(foo) is allowed to access any memory within the variable foo). You can also use ptr::addr_of! together with field accesses and indexing/slicing of raw slices (note that indexing on slices is a primitive compiler builtin, rather than a general overloaded Index trait method call). If you do, you get the provenance of the specified subslice. ptr::addr_of!(buf[n]) can access only memory within the n-th element of buf, while ptr::addr_of!(buf[a..b]) is restricted to the subslice buf[a..b].

Simple pointer casts and offsets do not change pointer provenance. This example is legal (again, run with Miri):

let buf = [0u8; 16];
let p: *const u8 = ptr::addr_of!(buf).cast();
ptr::read(p.offset(3));
ptr::read(p.offset(4).cast::<u32>());

But this one is not:

let buf = [0u8; 16];
let p: *const u8 = ptr::addr_of!(buf[8..]).cast();
ptr::read(p.offset(-3));

Also available as “let p = slice.as_ptr();” / “buf.as_ptr()”.

1 Like