Optimizing division by a constant at runtime

newpavlov · December 25, 2024, 4:28pm

It's not a question about Rust per se, but I hope someone will be able to help me.

Division instructions are relatively slow, so compilers commonly replace integer division by a constant with multiplication and shifts as can be seen here, i.e. it replaces val % 13111 with val - 1311 * (val * 1341788273) / 2^44. It works because 2^44/13111 is equal to 1341788272.78.. and both expressions produce the same result accounting for rounding down integer division. But it's not that trivial, since the alternative expression should produce the same result for all possible vals, e.g. division by 11111 looks like this.

In my code I have roughly this situation:

struct Buf {
    // Length of `buf` does not change and it's smaller than 2^32
    buf: Vec<Entry>,
}

fn scan_buf(buf: &Buf, mut pos: u64) {
    loop {
        let idx = (pos as usize) % buf.buf.len();
        if process_entry(&buf.buf[idx]) {
            break;
        }
        pos += 1;
    }
}

I would like to replace the division with a sequence of multiplications and shifts similarly to what compiler does for constants. Since the buf field does not change after struct initialization and I loop a lot over it, it's worth to pay additional runtime cost during Buf initialization to improve performance of the hot loop.

Maybe there is a Rust crate which implements this optimization? If not, could you point me toward a full description of the optimization which accounts for all potential corner cases so I cuold implement it myself?

undefined.behavior · December 25, 2024, 4:44pm

https://crates.io/crates/fastdivide

StefanSalewski · December 25, 2024, 6:36pm

Note, that in your scan_buf() function you can fully avoid the modulo division, and use a plain comparison instead. Because, whenever you increase pos by one , idx increases as well, until it wraps around. So you might set idx outside of the loop to the correct start value, then increase it in the loop body by one, and check if it is already equal buf.buf.len, in which case you set it back to zero. Most AIs will give you the full code. (The "if" should compile to a cmov assembly instruction, which is fast.)

RustyYato · December 25, 2024, 6:44pm

There is also crates.io: Rust Package Registry which has a more complete api. Including all integer types and remainders

user16251 · December 25, 2024, 7:20pm

There are a few useful references for variants of this optimization:

Granlund and Montgomery; there is also a follow-on paper
Arch Robison's patented method
Hacker's Delight by Henry Warren.

newpavlov · December 25, 2024, 7:50pm

In the real code I have several threads and use fetch_add(1), I did not include it to simplify the example. I could do something similar, but it would introduce branches to the hot loop. The optimized division also has branches, but they can be moved outside of the loop.

user16251 · December 25, 2024, 9:45pm

Also, it's worth noting that the crates typically can't make any assumptions on the bounds of the input, but if you have a tighter upper bounds on the numerator of an invariant division then you can typically avoid the 65-bit or 33-bit operations that are sometimes necessary. But it's really unusual for these microoptimizations to make a difference in runtime.

Topic		Replies	Views
High performance operations	6	4402	August 22, 2018
How can i speed up the div?	23	876	January 6, 2023
A performance problem with u64 mod [Solved]	6	1714	August 29, 2016
Efficient implementation of a * b % n for 64-bit values help	1	778	October 15, 2018
C++ vs Rust ( This is not a fight area :) ) help	19	1073	December 8, 2025

Optimizing division by a constant at runtime

Related topics