Excessive ASM instructions for typed ptr copies

I've been working on a serializer (ser_raw) which does a lot of memory copies, and have been investigating how fast different methods of copying are. One thing has me confused.

I would have expected that all of the following would produce the same assembly:

use std::ptr;

#[repr(C)]
pub struct Foo { x: u32, y: u16, z: u16 }

#[repr(C)]
pub struct Bar { x: u32, y: u16, z: u16 }

pub unsafe fn write_foo(foo: &Foo, out: *mut u8) {
    ptr::write(out as *mut Foo, Foo { ..*foo });
}

pub unsafe fn write_foo_as_bar(foo: &Foo, out: *mut u8) {
    ptr::write(out as *mut Bar, Bar { x: foo.x, y: foo.y, z: foo.z });
}

pub unsafe fn copy_foo(foo: &Foo, out: *mut u8) {
    ptr::copy_nonoverlapping(foo as *const Foo, out as *mut Foo, 1);
}

However, Godbolt says otherwise: Compiler Explorer
(and Rust playground confirms: Rust Playground)

The first 2 functions are compiled to:

mov   eax, dword ptr [rdi]
mov   dword ptr [rsi], eax
mov   eax, dword ptr [rdi + 4]
mov   dword ptr [rsi + 4], eax

But copy_foo is only 2 instructions:

mov   rax, qword ptr [rdi]
mov   qword ptr [rsi], rax

What has me really confused is that the compiler does combine the two u16 read+writes into a single u32 read+write, but it stops there, rather than combining again, to end up with just a single 8-byte read+write - as in copy_foo.

I suspect it has something to do with alignment, because if Foo is u16, u8, u8, that does get reduced to a 4-byte read+write. But still, why? The same compiler quite happily produces unaligned read/write instructions for other code.

Hope someone can explain these mysteries!

It looks like Foo { ..*foo } is inefficient.

In MIR, *foo is a single operation, but ..*foo copies field by field. For some reason LLVM can't/won't put the separate field copies back together.

1 Like

Yes, it does seem so. That leads to 3 more questions:

  • Should Foo { ..*foo } be more efficient?
  • If it should, who's fault is it that it isn't? LLVM? Or Rust for not giving LLVM the info required to make the optimization?
  • Any idea why LLVM does put the separate field copies back together when Foo is (u16, u8, u8) instead of (u32, u16, u16)? (Godbolt)

It might be a pass ordering thing (e.g. does it merge (u32, u32) okay but not (u32, u16, u16)? almost certainly pass ordering if so). If LLVM checks for merge-i32x2-as-i64 before merge-i8x2-as-i16 and merge-i16x2-as-i32, that would explain the behavior. (Pass ordering inefficiencies are rarely that trivial, but that illustrates it.)

It might also be an alignment thing (e.g. does sticking a stricter #[repr(align)] change it? almost certainly alignment is so). It doesn't seem likely to be intentional that LLVM would merge an aligned [i16 x 2] store to an unaligned i32 store, but not an aligned [i32 x 2] to unaligned i64, but I know basically nothing about peephole optimization of AMD64.

If you're really curious, throw the unoptimized LLIR into the godbolt LLVM view and open the optimization pass view; that'll give you more information than you know how to interpret most likely.

@CAD97 Thanks for coming back.

It does merge (u32, u32) (Godbolt).

#[repr(align(8))] (or 16) does not make any difference (Godbolt).

So it sounds like maybe this is a pass ordering thing. If so, would that be considered a bug in LLVM?

On your last point, I am really curious. But I'm afraid I'm pretty new to Rust, and even newer to assembly and LLVM, so have no idea how to get the unoptimized LLIR. If it's quick to explain, would you mind giving me a pointer?

To get the unoptimized LLVM IR you can use -Cno-prepopulate-passes to disable optimizations that always happen in combination with not passing any -Copt-level argument to enable optimizations.

Thanks @bjorn3. I'll give that a go. Will report back (but give me some days, work is crazy at the moment).

1 Like