Efficient implementation of a * b % n for 64-bit values

I'm trying to implement a function (a * b) % n for 64-bit values. x64 architecture has support for this computation, it's basically two instructions: mul and div and some register shuffling. I expected u128 data type to support it with proper optimizations. Rust compiler utilizes mul for 64*64 -> 128-bit multiplication, but it calls some function for division instead of using a single div instruction (I guess this function does 128-bit%128bit->128bit, but I really only need 128-bit%64-bit->64-bit). Is it possible to rewrite this code better? I understand that division is a very slow instruction and single function call won't hurt (if it uses div inside, of course), but may be I could write it even better anyway. Compiler could be smart enough to understand that n128 is not really 128-bit value and I casted it only to satisfy type checker, so it doesn't have to perform a real 128-bit division. I'm using https://rust.godbolt.org/ with -O flag to check assembly output.

pub fn mulmod(a: u64, b: u64, n: u64) -> u64 {
    let a128 = a as u128;
    let b128 = b as u128;
    let n128 = n as u128;
    let c128 = a128 * b128 % n128;
    return c128 as u64;

Here's generated assembly for reference:

  pushq %rax
  testq %rdx, %rdx
  je .LBB0_2
  movq %rdx, %r8
  movq %rsi, %rax
  mulq %rdi
  xorl %ecx, %ecx
  movq %rax, %rdi
  movq %rdx, %rsi
  movq %r8, %rdx
  callq __umodti3@PLT
  popq %rcx
  leaq .Lpanic_loc.2(%rip), %rdi
  callq core::panicking::panic@PLT

and I'm talking about __umodti3 call.

The LLVM being generated is correct:

  %1 = zext i64 %n to i128
  %2 = zext i64 %a to i128
  %3 = zext i64 %b to i128
  %4 = mul nuw i128 %3, %2
  %5 = urem i128 %4, %1
  %6 = trunc i128 %5 to i64
  ret i64 %6

So if this is a bug, it's not a rustc bug.

Looking into the assembly more, though, I think this behaviour is actually correct. Per https://www.felixcloutier.com/x86/DIV.html, it looks like the (u128,u64)->(u64,u64) instruction requires that the quotient fit in a u64, and if you were to call mulmod(MAX, MAX, 1), that wouldn't be the case.