Setting a bit without branching

Someone asked a question about setting a bit of a byte on SO. My question is: given a boolean and an offset, can we set the bit at offset to the boolean without branching?

Secondary question: is there anything to be gained by avoiding branching (assuming a simple embedded chip rather than one with pipelining and branch prediction)

just wrote

pub fn xx(mut a: u8,b:bool) -> u8 {
    if b {
        a|=8
    }
    a
}

is enough

you could test such code in godbolt.

Rust would optimize the if clause into a cmove:

xx:
        mov     eax, edi
        or      al, 8
        movzx   eax, al
        test    esi, esi
        cmove   eax, edi
        ret
1 Like

I thought the goal was to always change the value to the boolean, and in that case the provided code does not work. I would write the following:

pub fn xx(a: u8, b: bool) -> u8 {
    if b {
        a | 0b1000
    } else {
        a & !0b1000
    }
}

It compiles to the following assembly (I don't know whether this is more or less efficient):

playground::xx:
	and	dil, -9
	lea	eax, [8*rsi]
	or	al, dil
	ret

Rust Playground

2 Likes

Not sure about the assembly generated and whether this is faster or not, but you can write it entirely without if:

fn set_bit_cond(byte: &mut u8, pos: u8, bit: bool) {
    *byte |= (bit as u8) << pos
}

fn clear_bit_cond(byte: &mut u8, pos: u8, bit: bool) {
    *byte &= !((bit as u8) << pos)
}

fn store_bit(byte: &mut u8, pos: u8, bit: bool) {
    clear_bit_cond(byte, pos, !bit);
    set_bit_cond(byte, pos, bit);
}

fn main() {
    let mut x: u8 = 0;
    store_bit(&mut x, 0, true);
    assert_eq!(x, 0x01);
    store_bit(&mut x, 7, true);
    assert_eq!(x, 0x81);
    store_bit(&mut x, 0, false);
    assert_eq!(x, 0x80);
}

(Playground)


I just checked how is the assembler output when using store_bit:

pub fn foo(x: &mut u8, bit: bool) {
    store_bit(x, 3, bit);
}

using the previous store_bit method gives (with opt-level=3):

example::foo:
        shl     sil, 3
        lea     eax, [rsi - 9]
        and     al, byte ptr [rdi]
        or      al, sil
        mov     byte ptr [rdi], al
        ret

Note that "without branching" is normally a goal of security crypto code that is trying to avoid leaking secrets due to different execution time, rather than performance (In fact, branchless programming techniques will normally run slower - that's the whole point, afterall).

In these contexts, even not having an explicit branch in the assembly isn't enough, but I'm not nearly confident enough on the subject to give any sort of exhaustive list (I'm not sure anyone is, actually)

So, uh, be careful that you're actually doing whatever you're trying to do, I guess?

2 Likes

Avoiding branches (in particular: branches that are not predictable by the CPU) is also a performance optimization technique. A mis-predicted branch has a performance penalty (10s of cycles) because it stalls the instruction pipeline. The CPU will have started executing the wrong sequence of instructions (speculative execution) and then throw the results away.

1 Like

The assembler here is pretty similar to the code by @leob, just with extra instructions to load and store the value. The compiler transforms both alternatives to something like

x = (x & ~8) | (bit << 3);

I.e., unconditionally clear bit 3, then set it to the value of bit.

1 Like

If you don't even have pipelining, there's probably no point in avoiding branching. The branch-free code is more instructions in order to avoid the branch, so is only worth it if you have a chip where the expected penalty of the branch is higher than the cost of those instructions.

BTW, you probably meant Superscalar processor - Wikipedia or Speculative execution - Wikipedia -- pipelining is a very old technique (by processor standards) that requires minimal extra die space, so tends to be found even on small chips because it's such a huge performance win for very modest cost.

3 Likes

By replacing or with addition, you can save an additional instruction (since lea can compute x + 8*y). I'm a little disappointed that llvm doesn't recognize this.

Sounds like a great bug to report! They're on github now, so it's much easier than it used to be: https://github.com/llvm/llvm-project/issues/new.

(LLVM will probably turn addition into bit_or because it sees they don't overlap, and canonicalization wants the simplest form in the IR -- it's easier for other optimizations to deal with bit-parallel things instead of worrying about carries. But I agree this is sounds like a great thing for the LLVM assembly generator to do.)

Note that executing fewer instructions isn't necessarily always a benefit. Different instructions have different data dependencies and latencies, so the specific choice of instruction can have an outsized impact on processor time. (On the order of cycles, of course.) The processor manual may not even provide a fully accurate picture for modern chips; there's instructions which carry "false" data dependencies due to chip design which can cause formally unnecessary pipeline stalls, and documentation of these is often times very sparse.

In general, fewer instructions is better, and using lea is generally a better choice than other instructions with the same effect, but just be aware that exceptions exist at every corner. x86 and even ARM are complicated beasts when it comes to instruction level optimization; we're long past the days where processor performance is trivially predictable from the assembly.

While true (and why I added some weasel words), most of the time the compiler is pi pretty good at figuring out whether or not using a branch is needed for performance, as seen by the godbolt links. There's always exceptions of course!

But if you're looking up "how do I branchless" you'll find a lot of deliberately slower code, so it's still relevant.

1 Like