I want to do exactly what the x86-64 DIV instruction does: calculate (a / b, a % b) where a is u128, b is u64 and I know that the quotient a / b fits in u64:
pub fn div_mod(a: u128, b: u64) -> (u64, u64) {
let b = u128::from(b);
assert!(b != 0 && a < b << 64);
((a / b) as u64, (a % b) as u64)
}
So this could generate the DIV instruction that calculates both, but it doesn't : it instead calls into a software u128 by u128 division function (__udivti3) to calculate q = a / b and then calculates the modulo by a - q * b (compiler explorer), which is much more work.
Is there a reason this doesn't optimize to DIV?
Even if it doesn't realize it can do DIV, shouldn't there be a software u128 / u64 function which, while still more work than DIV (because the quotient is u128 rather than u64), is a lot simpler than calculating u128 / u128?
Googling I find a few references to Intel u64 div being awful, for example:
On Intel CPUs, div/idiv is microcoded as many uops. About the same number of uops for all operand-sizes up to 32-bit (Skylake = 10), but 64-bit is much much slower. (Skylake div r64 is 36 uops, Skylake idiv r64 is 57 uops). See Agner Fog's instruction tables: Software optimization resources. C++ and assembly. Windows, Linux, BSD, Mac OS X
div/idiv throughput for operand-sizes up to 32-bit is fixed at 1 per 6 cycles on Skylake. But div/idiv r64 throughput is one per 24-90 cycles.
So it might be worth playing with CPU tuning parameters eg setting -C target-cpu=native, to see if it notices you have a better performing DIV?
Side note: it does an interesting optimization. It first checks at runtime whether a and b both fit in u32, and in that case it switches to a 32-bit version of DIV.
Why do you think so? According to the div docs, it can perform 64-bit by 64-bit division, but it says nothing about 128-bit by 64-bit division support.
__udivti3 (software u128 / u128) calls into __udivmodti4 which computes both a / b and a % b, and ignores the a % b. Then the compiled code of my div_mod function recomputes a % b again using a - q * b.
So I think if you want the asm to be something different, your best bet is to file a bug on LLVM's x86 codegen.
(There's probably not an intrinsic, because LLVM represents things like full-width multiplication using mul, so even if there was an intrinsic for this it'd probably just emit a div internally.)