Volatile read of a `[u8; 8]` from MMIO returns `0`

In attempting to read a [u8; 8] from an MMIO register (to re-compose into a native u64 from little-endian bytes), the read_volatile seems to always return zero. However, when attempting to read a u64 it works fine. To my mind, reading 8 bytes as an array and reading 8 bytes as an integer should be equivalent.

What could be causing this?

EDIT: It appears that in release mode, the behaviour of read_volatile is much more clear here. It effectively translates to 8 1-byte reads, which are then re-composed into a single register. This is very weird to me.

1 Like

Why? No, really: why?

Data bus, in most modern CPUs are wider than one byte. It can be 32bit wide, 64bit wide, even 256bit wide, on some CPUs!

“Access to one byte” is an illusion that CPU prvides to programmers but that only works for regular memory, with MMIO and volatile access you are getting semantic that your CPU exposes… and you are not even telling us what that is!

Does C works differently on that hardware?

I'm sorry, I don't really see where I referenced accessing one byte. The CPU is a KVM host-emulated x86_64 CPU. If I am retrieving 8 bytes from RAM via a volatile read, the type information of the compiler should not matter to that operation; only that there is an 8-byte, well-aligned load from memory that should not be elided or merged with other loads.

This is somewhat illustrated by the disassembly, I think:

; reading a `[u8; 8]`
push   rbp
mov    rbp,rsp
sub    rsp,0x20
mov    QWORD PTR [rbp-0x18],rdi
mov    QWORD PTR [rbp-0x8],rdi
lea    rsi,[rip+0x1e6781]        # ffffffff804018b8 <memmove+0x159558>
call   ffffffff80253f60 <<*const [u8; 8]>::read_volatile>
mov    QWORD PTR [rbp-0x10],rax
mov    rax,QWORD PTR [rbp-0x10]
mov    QWORD PTR [rbp-0x20],rax
mov    rax,QWORD PTR [rbp-0x20]
add    rsp,0x20
pop    rbp
ret
; reading a `u64`
push   rbp
mov    rbp,rsp
sub    rsp,0x10
mov    QWORD PTR [rbp-0x10],rdi
mov    QWORD PTR [rbp-0x8],rdi
lea    rsi,[rip+0x1e67c1]        # ffffffff804018b8 <memmove+0x159638>
call   ffffffff80253f40 <<*const u64>::read_volatile>
add    rsp,0x10
pop    rbp

So far as I can see, they are roughly equivalent, save that the load of a [u8; 8] generates some seemingly pointless memory moves.

How is the MMIO register defined? Can you share some code snippets?

When you read [u8; 8] you are asking compiler to read byte, 8 times. And coalesce the result.

When yoiu are reading u64 you are doing one memory access.

Rust compiler does your requests pretty carefully: note how foo have eigth memory accesses and bar have one in the following example:

pub fn foo(x: &[u8; 8]) -> [u8; 8] {
    unsafe { (x as *const [u8; 8]).read_volatile() }
}

pub fn bar(x: &u64) -> u64 {
    unsafe { (x as *const u64).read_volatile() }
}

They are also using different functions for memory access, which is the whole point!

1 Like

I see, that makes sense. It is important that I read the MMIO register bytes as little-endian; do you know how I can ensure this without reading the integer types directly from the register?

Read as u64 and call swap_bytes?

1 Like

It's also important to define what that phrase even mean, first.

You can't. When you read chunk from MMIO you are not getting “little-endian” value or “big-endian” value, you just get 32bit or 64bit value (very rarely 16bit value on things like AVR), that is, then, interpreted differently depending on CPU.

This may or may not be the best approach. You really need to know what you doing, at this point, things like MMIO and read_volatile work “below” usual abstractions and one needs to know what you are reading, why and how. When different pieces of hardware interact you may even end up with mixed endian (probably not in Rust: I don't know of any platform where mixed-endian values are used that's compatible with Rust).

1 Like

I do not see how that applies to what I said. The byte ordering of the values stored in the MMIO registers is very much either little- or big- endian. Network bytes, for instance, are often big-endian, whereas the byte ordering of the value I am reading from the MMIO register is little-endian.

@jendrikw that is definitely one solution. I was hoping I could stick with zerocopy's endian-aware types (U64<LE>, U32<LE>, etc.), but it seems like that isn't an option here, since they are transparent wrappers around byte arrays.

Yes, but that explain how the bytes are put on the data bus. And then CPU interprets them one way or another depending on how it's wired. The bus itself is just 64 wires[1].

Yes, unfortunately that's where leaky abstractions get you. Anyone who worket with planar graphics would tell you that reading (or writing) byte, two times, and reading or writing two bytes are very different operations.

Modern software developers are insulated from that difference by layers of abstractions… but MMIO is where these break down.

Rust couldn't hide that difference because if it would do that… what would you us when you really need to do a byte access? Like when you program VGA? Even the latest and greatest GPU includes that beast, for compatibility…


  1. Well, there, sometimes pull resisters involved, but let's ignore that, for now. ↩︎

1 Like

Aha, I see where you're coming from, now. Thank you for all the explanations. And yes, I hadn't considered that the endian-aware types from zerocopy would be considered leaky abstractions; I had never really interfaced with that concept.

I think, for this, I will go ahead and write up some types that are backed by u64/u32/u16, and use swap_bytes() if the platform target's target_endian is different from the endianness of the wrapping type, as @jendrikw suggested.

I appreciate all the help!

C's behavior indeed differs from Rust's in this case:

I'm not sure this is intentional, or it is not "optimized" well, but at least to me, it feels very much like a bug of the rustc code generator.

look at the emited llvm IR of foo(): a single load instruction is issued for the type [8 x i8], but somehow the following instructions extract individual bytes from the loaded value, just to re-combine them into a scalar again:

define i64 @foo(ptr noalias noundef readonly align 1 captures(address, read_provenance) dereferenceable(8) %x) unnamed_addr {
start:
  %0 = load volatile [8 x i8], ptr %x, align 1
  %.fca.0.extract = extractvalue [8 x i8] %0, 0
  %.fca.1.extract = extractvalue [8 x i8] %0, 1
  %.fca.2.extract = extractvalue [8 x i8] %0, 2
  %.fca.3.extract = extractvalue [8 x i8] %0, 3
  %.fca.4.extract = extractvalue [8 x i8] %0, 4
  %.fca.5.extract = extractvalue [8 x i8] %0, 5
  %.fca.6.extract = extractvalue [8 x i8] %0, 6
  %.fca.7.extract = extractvalue [8 x i8] %0, 7
  %_0.sroa.8.0.insert.ext = zext i8 %.fca.7.extract to i64
  %_0.sroa.8.0.insert.shift = shl nuw i64 %_0.sroa.8.0.insert.ext, 56
  %_0.sroa.7.0.insert.ext = zext i8 %.fca.6.extract to i64
  %_0.sroa.7.0.insert.shift = shl nuw nsw i64 %_0.sroa.7.0.insert.ext, 48
  %_0.sroa.7.0.insert.insert = or disjoint i64 %_0.sroa.8.0.insert.shift, %_0.sroa.7.0.insert.shift
  %_0.sroa.6.0.insert.ext = zext i8 %.fca.5.extract to i64
  %_0.sroa.6.0.insert.shift = shl nuw nsw i64 %_0.sroa.6.0.insert.ext, 40
  %_0.sroa.6.0.insert.insert = or disjoint i64 %_0.sroa.7.0.insert.insert, %_0.sroa.6.0.insert.shift
  %_0.sroa.5.0.insert.ext = zext i8 %.fca.4.extract to i64
  %_0.sroa.5.0.insert.shift = shl nuw nsw i64 %_0.sroa.5.0.insert.ext, 32
  %_0.sroa.5.0.insert.insert = or disjoint i64 %_0.sroa.6.0.insert.insert, %_0.sroa.5.0.insert.shift
  %_0.sroa.4.0.insert.ext = zext i8 %.fca.3.extract to i64
  %_0.sroa.4.0.insert.shift = shl nuw nsw i64 %_0.sroa.4.0.insert.ext, 24
  %_0.sroa.4.0.insert.insert = or disjoint i64 %_0.sroa.5.0.insert.insert, %_0.sroa.4.0.insert.shift
  %_0.sroa.3.0.insert.ext = zext i8 %.fca.2.extract to i64
  %_0.sroa.3.0.insert.shift = shl nuw nsw i64 %_0.sroa.3.0.insert.ext, 16
  %_0.sroa.2.0.insert.ext = zext i8 %.fca.1.extract to i64
  %_0.sroa.2.0.insert.shift = shl nuw nsw i64 %_0.sroa.2.0.insert.ext, 8
  %_0.sroa.2.0.insert.mask = or disjoint i64 %_0.sroa.4.0.insert.insert, %_0.sroa.3.0.insert.shift
  %_0.sroa.0.0.insert.ext = zext i8 %.fca.0.extract to i64
  %_0.sroa.0.0.insert.mask = or disjoint i64 %_0.sroa.2.0.insert.mask, %_0.sroa.2.0.insert.shift
  %_0.sroa.0.0.insert.insert = or i64 %_0.sroa.0.0.insert.mask, %_0.sroa.0.0.insert.ext
  ret i64 %_0.sroa.0.0.insert.insert
}

btw, it's not because of rust's u8 vs llvm's i8, changing the souce code to [i8; 8] results the same llvm IR.

1 Like

You may be looking for the {from, to}_{le,be} methods? eg

2 Likes

It's probably because [u8; 8] has an alignment of one, so it's not necessarily valid to perform an 8-byte load on a memory address that is not properly aligned.

1 Like

I don't think that's cause.

the first IR is a single volatile load instruction, and the alignment is part of the IR instruction. because the alignment requirement is archtecture dependent, it is not a factor for what IR to emit.

to demonstrate it is not because of the alignment, but because of the volatile:

For what T is it legal in Rust to do a volatile read of * const T? It seems weird that you can do this for types besides (transparent wrappers of) primitive integers and pointers.

With current volatile semantics, for any T [if the specified address contains a valid value of that type]. Hopefully it will be revisited.

That would not be particularly ergonomic, because I would then have to convert the u64 that I read to bytes, and convert back. With swap_bytes(), I can change the byte ordering in-place based on the source and target endianness. EDIT: I misread what methods.

I will look into those methods! Thank you.

Hi,
does this imply it cannot be expected that data which is loaded from an MMIO mapped device internally storing data in big-endian order will just appear/be interpreted swapped in a little-endian CPU? I did not get the part about "And then CPU interprets them [...] depending on how it's wired", isn't this only relevant at the point there someone tries to numerically interpret the data (i.e. add, mul, sub, printing, etc.)? What is meant by "depending on how it's wired"?