For the first and last you could use leading_zeros and trailing_zeros and do some math to get the actual index. To calculate e.g. the second you could mask off the first 1, and then the second would become the first (which you can efficiently get the index of) and so on.
Regarding the linked loop, that loop would work if you stop when reaching the count you want. The iteration counter of the loop is your index.
A solution using leading_zeros or trailing_zeros would let you skip several zeros in one iteration (and these are usually single assembly instructions, so they are cheap). Still the running time depends on which bit you want; the 10th bit takes 10 times as many ops as the first bit.
The linked "hack" might be suboptimal. The count_ones function is actually implemented by a single instruction on some architectures (x86). Same with leading_zeros and trailing_zeros.
The operation of getting the index of the first bit set is just a single instruction (at least on x64), so that's quite fast. However this only works if there is a bit set (i.e. the number is not 0). If it's 0 then the result of the instruction is undefined, so those two methods need to check for that case too, which bloats the assembly a bit. This could however be optimized out if the optimizer sees it is an impossible case, but this depends on your code.
Can you provide more context? For problems like this there is no one "fastest" solution. It really depends on how many problems you have to do at once and how the output is formatted. Are you willing to put a big table in L1? Can you use vector instructions? Is just using trailz in a loop fast enough for your application?
Note that this is avoidable by working in NonZeroU64 instead, as that tells LLVM it doesn't need to worry about those cases: https://rust.godbolt.org/z/8Wbh9hdxr
Alternatively you may decide that you don't want to worry about 10+ years old CPUs (Haswell or Piledrier and everything after that support lzcnt/tzcnt and LLVM knows how to use them).
Rereading your original post you might want to use pdep and pext. If you deposit 1 << c in the mask x it will go to the position of the cth set bit in x which you can then read with tzcnt. If you want to decode bit sets you may find Daniel Lemire's series of articles interesting; here's one post. Of course you need a CPU with pdep or you can implement it by hand with the methods Hank Warren writes about in Hacker's Delight (buy this book!).