Is `next_power_of_two` inefficient, or am I missing something?

Most of what I do with Rust at the moment happens on a slightly weird embedded platform (the Game Boy Advance). I noticed something about next_power_of_two.

Godbolt link

fn next_u32(input: u32) -> u32 {
    input.next_power_of_two()
}
example::next_u32:
    push {r4, r6, r7, lr}
    add r7, sp, #8
    mov r4, r0
    sub r0, #1
    bl __clzsi2
    cmp r4, #1
    bhi .LBB0_2
    mov r0, #1
    pop {r4, r6, r7}
    pop {r1}
    bx r1
.LBB0_2:
    mov r1, #0
    mvn r1, r1
    lsr r1, r0
    add r0, r1, #1
    pop {r4, r6, r7}
    pop {r1}
    bx r1

A subroutine call to __clzsi2 and a branch. Probably on a machine with a hardware instruction for counting leading zeroes this would be nice and fast, but CLZ didn't show up until ARMv5. Compare that to this algorithm from Bit Twiddling Hacks.

Godbolt link

pub fn next_u32(mut input: u32) -> u32 {
    input -= 1;
    input |= input >> 1;
    input |= input >> 2;
    input |= input >> 4;
    input |= input >> 8;
    input |= input >> 16;
    input + 1
}
example::next_u32:
    sub r0, #1
    lsr r1, r0, #1
    orr r1, r0
    lsr r0, r1, #2
    orr r0, r1
    lsr r1, r0, #4
    orr r1, r0
    lsr r0, r1, #8
    orr r0, r1
    lsr r1, r0, #16
    orr r1, r0
    add r0, r1, #1
    bx lr

Every one of these instructions takes one cycle to execute because it operates entirely on data stored in registers. My initial instinct was to take this to the Rust issue tracker, but I'd like confirmation that that's a good idea first in case there's considerations I'm not thinking/aware of.

In general, the Rust standard library is optimized for mainstream platforms, which means that the performance may not be maximally optimal for some "slightly weird" platforms. There's a trade-off to be made here.

1 Like

Incidentally, 0_u32.next_power_of_two(0) == 1, whereas your version panics in debug mode and returns 0 in release mode. You need to check for input <= 1 (unless you take next_power_of_two(n) to mean 2^ceil(log2(n)), in which case n = 0 would be undefined).

2 Likes

Just out of curiosity, what does the equivalent C code look like when compiled with GCC and Clang?

It looks like Rust uses an intrinsic to figure out how many leading zeros there are, then does some twiddling to get the number 1 less than the next power of two. I wouldn't be surprised if LLVM couldn't optimise ctlz_nonzero well for your particular architecture, and if that's the case Clang would also generate suboptimal machine code.

(implementation from core for reference, copy-paste'd for each integer type)

const fn one_less_than_next_power_of_two(self) -> Self {
    if self <= 1 {
        return 0;
    }

    let p = self - 1;
    // SAFETY: Because `p > 0`, it cannot consist entirely of leading zeros.
    // That means the shift is always in-bounds, and some processors
    // (such as intel pre-haswell) have more efficient ctlz
    // intrinsics when the argument is non-zero.
    let z = unsafe { intrinsics::ctlz_nonzero(p) };
    <$selfTy>::MAX >> z
}

pub const fn next_power_of_two(self) -> Self {
    self.one_less_than_next_power_of_two() + 1
}
3 Likes

Looks like GCC does optimise slightly more. (The Bit Twiddling Hacks algorithm is identical in all three modulo register allocation.)

Fair point. That's still only one shortcircuit branch, though.

The implementation in core is written that way because it can use BSR (or BSF?) on x86, which was added for the 80386 back in 1985.

You could certainly send a PR to #[cfg] a different implementation for architectures that don't have such a thing -- it's then up to libs-impl to decide whether it's worth keeping. (I don't know if bors tests on ARMv4 to find out if it breaks, for example.)

1 Like