How to generate the `DIV` assembly instruction?

I want to do exactly what the x86-64 DIV instruction does: calculate (a / b, a % b) where a is u128, b is u64 and I know that the quotient a / b fits in u64:

pub fn div_mod(a: u128, b: u64) -> (u64, u64) {
    let b = u128::from(b);
    assert!(b != 0 && a < b << 64);
    ((a / b) as u64, (a % b) as u64)
}

So this could generate the DIV instruction that calculates both, but it doesn't : it instead calls into a software u128 by u128 division function (__udivti3) to calculate q = a / b and then calculates the modulo by a - q * b (compiler explorer), which is much more work.

  1. Is there a reason this doesn't optimize to DIV?
  2. Even if it doesn't realize it can do DIV, shouldn't there be a software u128 / u64 function which, while still more work than DIV (because the quotient is u128 rather than u64), is a lot simpler than calculating u128 / u128?
  3. Is there an intrinsic I can use to call DIV?
4 Likes

Googling I find a few references to Intel u64 div being awful, for example:

On Intel CPUs, div/idiv is microcoded as many uops. About the same number of uops for all operand-sizes up to 32-bit (Skylake = 10), but 64-bit is much much slower. (Skylake div r64 is 36 uops, Skylake idiv r64 is 57 uops). See Agner Fog's instruction tables: Software optimization resources. C++ and assembly. Windows, Linux, BSD, Mac OS X

div/idiv throughput for operand-sizes up to 32-bit is fixed at 1 per 6 cycles on Skylake. But div/idiv r64 throughput is one per 24-90 cycles.

So it might be worth playing with CPU tuning parameters eg setting -C target-cpu=native, to see if it notices you have a better performing DIV?

It seems unlikely it's using the software implementation for performance: if I simply do u64 / u64, it does use DIV:

pub fn div_mod(a: u64, b: u64) -> (u64, u64) {
    assert!(b != 0);
    (a / b, a % b)
}

(compiler explorer)

Side note: it does an interesting optimization. It first checks at runtime whether a and b both fit in u32, and in that case it switches to a 32-bit version of DIV.

Why do you think so? According to the div docs, it can perform 64-bit by 64-bit division, but it says nothing about 128-bit by 64-bit division support.

The last variant is 128-bit by 64-bit:

Unsigned divide RDX:RAX by r/m64, with result stored in RAX := Quotient, RDX := Remainder.

Also it says it in table 3-15:

Doublequadword/quadword

The different variants of DIV do: u16 / u8, u32 / u16, u64 / u32 and u128 / u64.

2 Likes

Another bit that seems suboptimal:

__udivti3 (software u128 / u128) calls into __udivmodti4 which computes both a / b and a % b, and ignores the a % b. Then the compiled code of my div_mod function recomputes a % b again using a - q * b.

LLVM's middle-end leaves the udiv i128 there:

define { i64, i64 } @_ZN7example9div_mod_u17hbcb810e1e6670fc5E(i128 noundef %a, i64 noundef %b) unnamed_addr #0 !dbg !7 {
  %b1 = zext i64 %b to i128, !dbg !12
  %_5 = shl nuw i128 %b1, 64
  %_4 = icmp ugt i128 %_5, %a
  br i1 %_4, label %bb5, label %bb4, !dbg !20, !prof !22

bb4:                                              ; preds = %start
  tail call void @_ZN4core9panicking5panic17h941160ead03e2d54E(ptr noalias noundef nonnull readonly align 1 @alloc_1851edc8e5d62106900e48e85e19edb3, i64 noundef 39, ptr noalias noundef nonnull readonly align 8 dereferenceable(24) @alloc_bb351a9b0beb7ffcca1a0e7ab52c5b45) #2, !dbg !23
  unreachable, !dbg !23

bb5:                                              ; preds = %start
  %_8 = udiv i128 %a, %b1, !dbg !24
  %_7 = trunc i128 %_8 to i64, !dbg !24
  %0 = mul i128 %_8, %b1, !dbg !25
  %_11.decomposed = sub i128 %a, %0, !dbg !25
  %_10 = trunc nuw i128 %_11.decomposed to i64, !dbg !25
  %1 = insertvalue { i64, i64 } poison, i64 %_7, 0, !dbg !26
  %2 = insertvalue { i64, i64 } %1, i64 %_10, 1, !dbg !26
  ret { i64, i64 } %2, !dbg !26
}

So I think if you want the asm to be something different, your best bet is to file a bug on LLVM's x86 codegen.

(There's probably not an intrinsic, because LLVM represents things like full-width multiplication using mul, so even if there was an intrinsic for this it'd probably just emit a div internally.)

2 Likes

Would inline assembly work for you?