Here's a lockless allocator for allocating single bits in a large bit vector. Application is allocating Vulkan bindless graphics descriptors.
What the compiler optimized this into is awesome. The inner loop in alloc_bit, which is looking for a zero bit in an array, turned into one instruction:
rep bsf rbp, rcx
That's impressive. What the compiler seems to have done is:
- The code fetches a word with an atomic operation. Regular u64 loads are atomic on x86 so that doesn't take any special code.
- The loop tests if the word is all ones, then asks for the number of low-end zero bits. The compiler recognizes that the second test can subsume the first test, so it removes the first test.
- Now the test for low-end zero bits needs to be iterated over the array. That's done by repeating (rep) the count low order zeroes instruction (bsf) with it set to terminate when it finds something.
- Now the found bit needs to be set. The source code left-shifts and ORs a 1 into the correct position, which the compiler optimizes into one set bit instruction.