Why the compiler does not optimize to the best Assembly possible?

I know that the Rust compiler will optimize struct fields order automatically. Considering these C++ and Rust code

C++ does not optimize struct fields order as you can see it writes 4 byte, then 8 byte, then 4 byte which is not good

Zig is same

Rust one does some optimization. It only write to memory 2 times, because memory is slow so avoiding it makes thing faster

But that is not the best one. Considering the next pics

Now C++ has better output than Rust. It does not writes to memory at all, but it uses register completely

Zig is same

Rust can achieve the same but it requires hack that almost no Rust user uses it when I see their code from Github, whether library or personal project

My question is why the Rust compiler does not optimize it to the best one?

  1. Can you prove that that's the best one? "Less assembly" is not at all "best" for speed -- see Converting a BGRA &[u8] to RGB [u8;N] (for images)? - #13 by CAD97 for some incredible measurement work that demonstrates that nicely.
  2. If you think "memory is slow so avoiding it makes thing faster" then the one that reads from a constant then writes must be slower than the one that writes an immediate, no?
  3. https://github.com/llvm/llvm-project/ is happy to accept PRs for good optimizations; if there's something missing feel free to work on it.
  4. It's unclear that whether something is "best" is even knowable for an optimizer.
  5. Even if it is knowable, there will always be engineering tradeoffs -- if nothing else, because you want compilation to finish in less than a week.
  6. Do you have any indication of this assembly difference having a measurable impact on a real workload?
7 Likes

I think the ABI is incompatible. multiple scalars cannot be passed in a single register, even if the size fits in.

to be honest, this is a contrived example and only matters for functions with external linkage.

normally, such functions are the perfect candidates for inlining, and the whole function then would probably be completely optmized away.

for example:

2 Likes

I think this is related to how different languages handle struct field reordering.

Rust optimizes for memory density by default, reordering fields to minimize padding and default ordering for this struct for example uses 24 bytes,

struct Record {                                                                    
    active: bool,                                                                  
    id: u64,                                                                       
    flags: u8,                                                                     
    timestamp: u64,                                                                
    kind: u8,                                                                      
}                                                                                  
                                                                                    
#[unsafe(no_mangle)]                                                               
pub fn get_size() -> usize {                                                       
    std::mem::size_of::<Record>()                                                  
}

Assembly:
get_size:
  mov eax, 24
  ret

C++ uses 40 bytes:

#include <cstdint>                                                                 
#include <cstddef>                                                                 
                                                                                    
struct Record {                                                                    
    bool active;                                                                   
    uint64_t id;                                                                   
    uint8_t flags;                                                                 
    uint64_t timestamp;                                                            
    uint8_t kind;                                                                  
};                                                                                 
                                                                                    
extern "C" size_t get_size() {                                                     
    return sizeof(Record);                                                         
}                                                                                  
                                                                                                                                       
Assembly:
get_size:
  mov eax, 40
  ret

However, you're right that this can hurt ABI efficiency for small struct returns. But in the example I provided, for 1 million 'records' you would use 24MB memory in Rust while 40MB in C++

Rust chose memory efficiency as the default because:

  1. It benefits arrays and cache behavior (the common case)
  2. When you need ABI-optimal layout, #[repr(C)] gives you explicit control (it's not really that uncommon)

EDIT: The point I am trying to make here is that the same case can be conversely made for C++ too. I am not familiar with zig, so it may optimize for both small and large structs by default? I am sure someone experienced will be able to provide a more nuanced answer here.

1 Like

Another interesting point is, the compiler does not do any reorder optimization in the following struct random fields order. Is this bug?


#[repr(C)]
struct DataC {
    a0: u64,
    a1: u64,
    a2: u64,
    a3: u64,
    a4: u64,
    a5: u64,
    a6: u64,
    a7: u64,

    c0: u64,
    c1: u64,
    c2: u64,
    c3: u64,
    c4: u64,
    c5: u64,
    c6: u64,
    c7: u64,

    b0: f64,
    b1: f64,
    b2: f64,
    b3: f64,
    b4: f64,
    b5: f64,
    b6: f64,
    b7: f64,
}

struct DataRust {
    a0: u64,
    b0: f64,
    c0: u64,
    b1: f64,
    a1: u64,
    b2: f64,
    c1: u64,
    a2: u64,

    b3: f64,
    c2: u64,
    b4: f64,
    a3: u64,
    c3: u64,
    b5: f64,
    a4: u64,
    b6: f64,

    c4: u64,
    a5: u64,
    b7: f64,
    c5: u64,
    a6: u64,
    c6: u64,
    a7: u64,
    c7: u64,
}

impl DataC {
    pub fn new() -> Self {
        Self {
            a0: 1,
            a1: 2,
            a2: 3,
            a3: 4,
            a4: 5,
            a5: 6,
            a6: 7,
            a7: 8,

            c0: 9,
            c1: 10,
            c2: 11,
            c3: 12,
            c4: 13,
            c5: 14,
            c6: 15,
            c7: 16,

            b0: 1.0,
            b1: 2.0,
            b2: 3.0,
            b3: 4.0,
            b4: 5.0,
            b5: 6.0,
            b6: 7.0,
            b7: 8.0,
        }
    }
}

impl DataRust {
    pub fn new() -> Self {
        Self {
            a0: 1,
            a1: 2,
            a2: 3,
            a3: 4,
            a4: 5,
            a5: 6,
            a6: 7,
            a7: 8,

            b0: 1.0,
            b1: 2.0,
            b2: 3.0,
            b3: 4.0,
            b4: 5.0,
            b5: 6.0,
            b6: 7.0,
            b7: 8.0,

            c0: 9,
            c1: 10,
            c2: 11,
            c3: 12,
            c4: 13,
            c5: 14,
            c6: 15,
            c7: 16,
        }
    }
}

#[unsafe(no_mangle)]
extern "C" fn task_c() -> DataC {
    DataC::new()
}

#[unsafe(no_mangle)]
fn task_rust() -> DataRust {
    DataRust::new()
}

So I have to order it manually. The effect is, LLVM is able to do auto vectorization (auto SIMD) in the manually ordered version. But can not auto vectorization in the default version

.LCPI0_0:
        .quad   0x3ff0000000000000
        .quad   0x4000000000000000
.LCPI0_1:
        .quad   0x4008000000000000
        .quad   0x4010000000000000
.LCPI0_2:
        .quad   0x4014000000000000
        .quad   0x4018000000000000
.LCPI0_3:
        .quad   0x401c000000000000
        .quad   0x4020000000000000
task_c:
        mov     rax, rdi
        mov     qword ptr [rdi], 1
        mov     qword ptr [rdi + 8], 2
        mov     qword ptr [rdi + 16], 3
        mov     qword ptr [rdi + 24], 4
        mov     qword ptr [rdi + 32], 5
        mov     qword ptr [rdi + 40], 6
        mov     qword ptr [rdi + 48], 7
        mov     qword ptr [rdi + 56], 8
        mov     qword ptr [rdi + 64], 9
        mov     qword ptr [rdi + 72], 10
        mov     qword ptr [rdi + 80], 11
        mov     qword ptr [rdi + 88], 12
        mov     qword ptr [rdi + 96], 13
        mov     qword ptr [rdi + 104], 14
        mov     qword ptr [rdi + 112], 15
        mov     qword ptr [rdi + 120], 16
        movaps  xmm0, xmmword ptr [rip + .LCPI0_0]
        movups  xmmword ptr [rdi + 128], xmm0
        movaps  xmm0, xmmword ptr [rip + .LCPI0_1]
        movups  xmmword ptr [rdi + 144], xmm0
        movaps  xmm0, xmmword ptr [rip + .LCPI0_2]
        movups  xmmword ptr [rdi + 160], xmm0
        movaps  xmm0, xmmword ptr [rip + .LCPI0_3]
        movups  xmmword ptr [rdi + 176], xmm0
        ret

task_rust:
        mov     rax, rdi
        mov     qword ptr [rdi], 1
        movabs  rcx, 4607182418800017408
        mov     qword ptr [rdi + 8], rcx
        mov     qword ptr [rdi + 16], 9
        movabs  rcx, 4611686018427387904
        mov     qword ptr [rdi + 24], rcx
        mov     qword ptr [rdi + 32], 2
        movabs  rcx, 4613937818241073152
        mov     qword ptr [rdi + 40], rcx
        mov     qword ptr [rdi + 48], 10
        mov     qword ptr [rdi + 56], 3
        movabs  rcx, 4616189618054758400
        mov     qword ptr [rdi + 64], rcx
        mov     qword ptr [rdi + 72], 11
        movabs  rcx, 4617315517961601024
        mov     qword ptr [rdi + 80], rcx
        mov     qword ptr [rdi + 88], 4
        mov     qword ptr [rdi + 96], 12
        movabs  rcx, 4618441417868443648
        mov     qword ptr [rdi + 104], rcx
        mov     qword ptr [rdi + 112], 5
        movabs  rcx, 4619567317775286272
        mov     qword ptr [rdi + 120], rcx
        mov     qword ptr [rdi + 128], 13
        mov     qword ptr [rdi + 136], 6
        movabs  rcx, 4620693217682128896
        mov     qword ptr [rdi + 144], rcx
        mov     qword ptr [rdi + 152], 14
        mov     qword ptr [rdi + 160], 7
        mov     qword ptr [rdi + 168], 15
        mov     qword ptr [rdi + 176], 8
        mov     qword ptr [rdi + 184], 16
        ret

The layout optimization it does is purely based on the field types. There's no "POGO for layout". I wouldn't call it a bug.

1 Like

I've only skimmed it, but this article from a year and a half ago looks like it goes into detail on the issue you're seeing. It seems that for calling conventions rustc just punts to LLVM, which has some surprisingly not-so-great defaults when compared to C.

Did you check also GNU Rust compiler?

It does not. Rustc does compute the calling convention on its own except for which exact register to use and the stack layout. For the Rust ABI we do cast values that fit in a single register to be passed in one:

but we don't do this for values that need two registers anymore since

2 Likes