Why can't this code be further optimized?

Consider the following code:

/// minimum bytes to represent a signed i32,
/// not considering the sign bit
pub fn mbsi32(x: i32) -> u8 {
    // To fit in n bytes, we require that
    // everything but the leading sign bits fits in n*8
    // bits.
    let n_sign_bits = if x.is_negative() {
        x.leading_ones() as u8
    } else {
        x.leading_zeros() as u8
    };

    (32 - n_sign_bits + 7) / 8
}

In release mode with the default target, it compiles to the following:

playground::mbsi32:
	mov	eax, edi
	sar	eax, 31
	xor	eax, edi
	je	.LBB1_1
	bsr	ecx, eax
	xor	ecx, 31
	mov	al, 39
	sub	al, cl
	shr	al, 3
	ret

.LBB1_1:
	mov	ecx, 32
	mov	al, 39
	sub	al, cl
	shr	al, 3
	ret

I'm wondering why LBB1_1 can't be optimized to just ret. It can only be reached if rax is 0, in which case, shouldn't LLVM be able to do constant folding to figure out that it always returns 0?

2 Likes

This is an artifact of the middle-end and back-end separation in LLVM.

If you compile for a newer version of x64 that has lzcnt, you'll get this instead:

mbsi32:
        mov     eax, edi
        not     eax
        lzcnt   eax, eax
        lzcnt   ecx, edi
        test    edi, edi
        cmovs   ecx, eax
        mov     al, 39
        sub     al, cl
        shr     al, 3
        ret

https://rust.godbolt.org/z/jbsEWGGK4

The generic optimizer in LLVM optimizes it to this:

define noundef i8 @mbsi32(i32 noundef %x) unnamed_addr {
start:
  %x.lobit = ashr i32 %x, 31
  %self.sink = xor i32 %x.lobit, %x
  %0 = tail call i32 @llvm.ctlz.i32(i32 %self.sink, i1 false), !range !3
  %n_sign_bits.0 = trunc i32 %0 to i8
  %_6 = sub nuw nsw i8 39, %n_sign_bits.0
  %_02 = lshr i8 %_6, 3
  ret i8 %_02
}

But that i1 false in the call to ctlz means that it has to handle zero as an input.

However, the bsr instruction doesn't support an input of zero. So on old targets, the target-specific codegen adds a branch. Whereas on new targets it can use lzcnt instead, and doesn't need the extra branch.

And the branch added in codegenprepare isn't optimized out because the backend part doesn't do as aggressive optimizations as the middleend part, in LLVM.

NonZeroU32::leading_zeros tells LLVM it doesn't have to worry about zero, and thus doesn't have the extra branch, but since it sounds like you do care about zero here (after all, it didn't take NonZeroI32) I don't think that'll help you here.

7 Likes

Why does the IR have a branchless computation (leadz(x ^ (x asr 31))) that gets turned back into something branchy on x86 (if x >= 0 { leadz(x) } else { leadz(!x) })?

as already commented by @scottmcm , because the code generator can't use the lzcnt instruction but have to emulate the identical functionality with the bsr instruction, because the lzcnt instruction doesn't exist on older hardware of the default x86_64 target.

you can pass a command line flag -C target-feature=+lzcnt to the compiler so the backend knows the lzcnt instruction can be used, then you'll get the desired output.

if you are building for local machine, you can simply use -C target-cpu=native to generate code that can use all the hardware features of your cpu.

the default target archtecture choice is over-conservative in my opinion. you can search keywords like rust and x86_64-v2 for more discussions.

As the OP described, that's not what the branch does -- the branch is if x ^ (x >> 31) == 0. It's only there to keep the bsr from getting a zero input.

I must be missing something. The original assembly with bsr has a branch to test if the input is zero. But the assembly with lzcnt also has a different branch to test if edi < 0. It computes two leading-zero counts. What I'm confused about is why, if you have lzcnt, you don't just get

mov eax, edi
sar eax, 31
xor eax, edi
lzcnt ecx, eax
mov al, 39
sub al, cl
shr al, 3
ret

Is it really faster to to do both and cmov?

You'd have to test, like with any optimization, but I wouldn't be surprised: your example has every instruction depend on the previous, while in @scottmcm's example listing the eax and edi paths are independent and can run in parallel:

mbsi32:
        mov     eax, edi
        not     eax
        lzcnt   eax, eax ; negative result 

        ; doesn't depend on previous instructions
        lzcnt   ecx, edi
        test    edi, edi ; positive result + test

        cmovs   ecx, eax ; combine results here
2 Likes

It has a conditional move but not a branch: each instruction in the lzcnt version (other than the ret of course) has the program counter move to the next instruction in the stream always.

They're very different things, especially around out-of-order and speculative execution. Branches can cause problems if they're predicted wrong and execution needs to backtrack. Conditional moves can cause problems if they unnecessarily bring together data dependencies of different latencies.

If you want to go further down this rabbit hole, you can use https://llvm.org/docs/CommandGuide/llvm-mca.html#bottleneck-analysis to compare them. But for big fat desktop chips like x86-64-v3 for which LLVM was asked to compile the code in that example, you have to consider IPC to know what's really faster.


To help illustrate how messy this can be, note that if you ask to compile for a znver2 specifically, you get slightly different assembly: https://rust.godbolt.org/z/9fKP5cesn

mbsi32:
        mov     eax, edi
        lzcnt   ecx, edi
        not     eax
        lzcnt   eax, eax
        test    edi, edi
        cmovs   ecx, eax
        mov     al, 39
        sub     al, cl
        shr     al, 3
        ret

Why? No idea. You'd probably have to read the architecture manuals.

6 Likes

Thank you for the correction.