However, BitSet::iter() does not provide an O(1) implementation of nth(), so this is somewhat slow. I have evidence from profiling that this operation is a bottleneck for my program. My bitset almost always fits into a single usize (around 98% of the time, to be specific). I'm using the fastrand - Rust crate as a random number generator and it looks like number generation is not the bottleneck here.
I'm wondering if there's a crate or approach out there I could use for a faster implementation of this. I'm targeting x86_64 and aarch64 here, and I'm particularly interested in crates that utilize tools like the BMI2 instructions on newer Intel chips to implement this kind of operation very efficiently, or e.g. using something like the __popcnt intrinsic to narrow down which byte to search.
But how big a fraction of the length is that? Ie. If it’s commonly, say, more than 1/4 full, you can just generate indices and do rejection sampling with an $O_p(1)$ expected performance.
On the other hand, if it’s typically very sparse, then you could indeed go through it block by block (using the underlying BitVec API or maybe the intersection()method if it has a fast implementation) and reject those that equal 0, weighting the rest by popcount (ie. count_ones() in Rust).
The expectation is that you will find an element within 2 loops most of the time. Unrolling three checks should almost entirely eliminate the (small) overhead.
Do bit count using branchless parallel log 64 = 6 pairwise accumulation steps. Then select the random index, and do branchless binary search using those counts in another 6 steps. This is explained on this website.
With BMI2 instructions on x86-64, the nth bit can be selected using the pdep instruction (pdep(bitset, 1<<n)), and then trailing_zeros will get its index.