It looks like __builtin_ia32_addcarryx_u64 returns the carry of the addition of 2 u64 numbers as a u8. While I could implement this in Rust, would be nice to have the fast version of it.

Some ISAs support this (e.g., X-86 & semi-clones), some do not (e.g., RISC-V). The underlying mathematical operation was trivial to realize in the sequential unpipelined ALUs of the early 1960s where it was used to realize multi-precision add/subtract on implementations with very narrow 4-bit, 8-bit, etc adders.

However, materializing this operation in modern super-scalar ALU implementations has a signficant gate/energy/delay/die-size cost.

Addendum: It's not the lowest-bit carry-in circuit that is the problem; it's that the instruction presumes that there is a Carry flag that recorded the carry-out result of a prior instruction, which is then used as the carry-in to the following Add-with-carry instruction. The logic/delay/pipeline-interlock/etc logic that is required so that the output of the first instruction can be available immediately as input to the following instruction quite complicates the implementation. That's why RISC-V has no such flags, which keeps the minimum number of gate delays required to realize an Add instruction quite small.

This will definitely be the right answer eventually. That'll (eventually) give iadd_carry in cranelift, for example, and similarly the appropriate chain of things in LLVM or whatever.

You can copy its implementation for now, or just do the easy thing of doing the work in a wider type.