Someone asked a question about setting a bit of a byte on SO. My question is: given a boolean and an offset, can we set the bit at offset to the boolean without branching?
Secondary question: is there anything to be gained by avoiding branching (assuming a simple embedded chip rather than one with pipelining and branch prediction)
just wrote
pub fn xx(mut a: u8,b:bool) -> u8 {
if b {
a|=8
}
a
}
is enough
you could test such code in godbolt.
Rust would optimize the if clause into a cmove:
xx:
mov eax, edi
or al, 8
movzx eax, al
test esi, esi
cmove eax, edi
ret
I thought the goal was to always change the value to the boolean, and in that case the provided code does not work. I would write the following:
pub fn xx(a: u8, b: bool) -> u8 {
if b {
a | 0b1000
} else {
a & !0b1000
}
}
It compiles to the following assembly (I don't know whether this is more or less efficient):
playground::xx:
and dil, -9
lea eax, [8*rsi]
or al, dil
ret
Not sure about the assembly generated and whether this is faster or not, but you can write it entirely without if
:
fn set_bit_cond(byte: &mut u8, pos: u8, bit: bool) {
*byte |= (bit as u8) << pos
}
fn clear_bit_cond(byte: &mut u8, pos: u8, bit: bool) {
*byte &= !((bit as u8) << pos)
}
fn store_bit(byte: &mut u8, pos: u8, bit: bool) {
clear_bit_cond(byte, pos, !bit);
set_bit_cond(byte, pos, bit);
}
fn main() {
let mut x: u8 = 0;
store_bit(&mut x, 0, true);
assert_eq!(x, 0x01);
store_bit(&mut x, 7, true);
assert_eq!(x, 0x81);
store_bit(&mut x, 0, false);
assert_eq!(x, 0x80);
}
I just checked how is the assembler output when using store_bit
:
pub fn foo(x: &mut u8, bit: bool) {
store_bit(x, 3, bit);
}
using the previous store_bit
method gives (with opt-level=3
):
example::foo:
shl sil, 3
lea eax, [rsi - 9]
and al, byte ptr [rdi]
or al, sil
mov byte ptr [rdi], al
ret
Note that "without branching" is normally a goal of security crypto code that is trying to avoid leaking secrets due to different execution time, rather than performance (In fact, branchless programming techniques will normally run slower - that's the whole point, afterall).
In these contexts, even not having an explicit branch in the assembly isn't enough, but I'm not nearly confident enough on the subject to give any sort of exhaustive list (I'm not sure anyone is, actually)
So, uh, be careful that you're actually doing whatever you're trying to do, I guess?
Avoiding branches (in particular: branches that are not predictable by the CPU) is also a performance optimization technique. A mis-predicted branch has a performance penalty (10s of cycles) because it stalls the instruction pipeline. The CPU will have started executing the wrong sequence of instructions (speculative execution) and then throw the results away.
The assembler here is pretty similar to the code by @leob, just with extra instructions to load and store the value. The compiler transforms both alternatives to something like
x = (x & ~8) | (bit << 3);
I.e., unconditionally clear bit 3, then set it to the value of bit
.
If you don't even have pipelining, there's probably no point in avoiding branching. The branch-free code is more instructions in order to avoid the branch, so is only worth it if you have a chip where the expected penalty of the branch is higher than the cost of those instructions.
BTW, you probably meant Superscalar processor - Wikipedia or Speculative execution - Wikipedia -- pipelining is a very old technique (by processor standards) that requires minimal extra die space, so tends to be found even on small chips because it's such a huge performance win for very modest cost.
By replacing or with addition, you can save an additional instruction (since lea
can compute x + 8*y
). I'm a little disappointed that llvm doesn't recognize this.
Sounds like a great bug to report! They're on github now, so it's much easier than it used to be: https://github.com/llvm/llvm-project/issues/new.
(LLVM will probably turn addition into bit_or
because it sees they don't overlap, and canonicalization wants the simplest form in the IR -- it's easier for other optimizations to deal with bit-parallel things instead of worrying about carries. But I agree this is sounds like a great thing for the LLVM assembly generator to do.)
Note that executing fewer instructions isn't necessarily always a benefit. Different instructions have different data dependencies and latencies, so the specific choice of instruction can have an outsized impact on processor time. (On the order of cycles, of course.) The processor manual may not even provide a fully accurate picture for modern chips; there's instructions which carry "false" data dependencies due to chip design which can cause formally unnecessary pipeline stalls, and documentation of these is often times very sparse.
In general, fewer instructions is better, and using lea
is generally a better choice than other instructions with the same effect, but just be aware that exceptions exist at every corner. x86 and even ARM are complicated beasts when it comes to instruction level optimization; we're long past the days where processor performance is trivially predictable from the assembly.
While true (and why I added some weasel words), most of the time the compiler is pi pretty good at figuring out whether or not using a branch is needed for performance, as seen by the godbolt links. There's always exceptions of course!
But if you're looking up "how do I branchless" you'll find a lot of deliberately slower code, so it's still relevant.
This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.