Bit twiddling hack

Let x:u64 be viewed as a vector of length 64, where each elem is a bit 0 or 1.

Let k be the number of 1 bits in x.

Let the indices of the 1's be i_0, i_1, ... i_{k-1}.

What is the fastest way, given (c, x) to compute i_c ? Notation here a bit bad, as i is depend on x.

Basically I want to be able to rapidly compute: where does the 1st 1 appear? Where does the 2nd 1 appear? Where does the 3rd 1 appear? ...

For the first and last you could use leading_zeros and trailing_zeros and do some math to get the actual index. To calculate e.g. the second you could mask off the first 1, and then the second would become the first (which you can efficiently get the index of) and so on.

1 Like

Do you know how many ops this is? I'm looking for something similar to Bit Twiddling Hacks , but instead of counting bits, calculates the index.

Regarding the linked loop, that loop would work if you stop when reaching the count you want. The iteration counter of the loop is your index.

A solution using leading_zeros or trailing_zeros would let you skip several zeros in one iteration (and these are usually single assembly instructions, so they are cheap). Still the running time depends on which bit you want; the 10th bit takes 10 times as many ops as the first bit.

The linked "hack" might be suboptimal. The count_ones function is actually implemented by a single instruction on some architectures (x86). Same with leading_zeros and trailing_zeros.

The operation of getting the index of the first bit set is just a single instruction (at least on x64), so that's quite fast. However this only works if there is a bit set (i.e. the number is not 0). If it's 0 then the result of the instruction is undefined, so those two methods need to check for that case too, which bloats the assembly a bit. This could however be optimized out if the optimizer sees it is an impossible case, but this depends on your code.

Using the count_ones method, it is possible to determine if a given answer is too small or large. This makes it possible to use binary search:

fn find_pos(x: u64, c: u32) -> u64 {
    let mut pos = 0;
    
    let mut pos_bit = 1 << 5;
    while pos_bit != 0 {
        let pos_with_bit = pos | pos_bit;
        let mask = (1 << (pos_with_bit - 1)) - 1;
        let count = (x & mask).count_ones();
        if count <= c {
            pos = pos_with_bit;
        }
        
        pos_bit = pos_bit >> 1;
    }
    pos - 1
}

The loop will iterate six times so you can unroll it, and the if count <= c can probably be made branchless.

5 Likes

The bsf and bsr have bad behavior when the input is zero, but tzcnt and lzcnt return the width of the type in bits.

Can you provide more context? For problems like this there is no one "fastest" solution. It really depends on how many problems you have to do at once and how the output is formatted. Are you willing to put a big table in L1? Can you use vector instructions? Is just using trailz in a loop fast enough for your application?

Note that this is avoidable by working in NonZeroU64 instead, as that tells LLVM it doesn't need to worry about those cases: https://rust.godbolt.org/z/8Wbh9hdxr

1 Like

Alternatively you may decide that you don't want to worry about 10+ years old CPUs (Haswell or Piledrier and everything after that support lzcnt/tzcnt and LLVM knows how to use them).

1 Like

Rereading your original post you might want to use pdep and pext. If you deposit 1 << c in the mask x it will go to the position of the cth set bit in x which you can then read with tzcnt. If you want to decode bit sets you may find Daniel Lemire's series of articles interesting; here's one post. Of course you need a CPU with pdep or you can implement it by hand with the methods Hank Warren writes about in Hacker's Delight (buy this book!).

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.