Optimizing division by a constant at runtime

It's not a question about Rust per se, but I hope someone will be able to help me.

Division instructions are relatively slow, so compilers commonly replace integer division by a constant with multiplication and shifts as can be seen here, i.e. it replaces val % 13111 with val - 1311 * (val * 1341788273) / 2^44. It works because 2^44/13111 is equal to 1341788272.78.. and both expressions produce the same result accounting for rounding down integer division. But it's not that trivial, since the alternative expression should produce the same result for all possible vals, e.g. division by 11111 looks like this.

In my code I have roughly this situation:

struct Buf {
    // Length of `buf` does not change and it's smaller than 2^32
    buf: Vec<Entry>,
}

fn scan_buf(buf: &Buf, mut pos: u64) {
    loop {
        let idx = (pos as usize) % buf.buf.len();
        if process_entry(&buf.buf[idx]) {
            break;
        }
        pos += 1;
    }
}

I would like to replace the division with a sequence of multiplications and shifts similarly to what compiler does for constants. Since the buf field does not change after struct initialization and I loop a lot over it, it's worth to pay additional runtime cost during Buf initialization to improve performance of the hot loop.

Maybe there is a Rust crate which implements this optimization? If not, could you point me toward a full description of the optimization which accounts for all potential corner cases so I cuold implement it myself?

https://crates.io/crates/fastdivide

4 Likes

Note, that in your scan_buf() function you can fully avoid the modulo division, and use a plain comparison instead. Because, whenever you increase pos by one , idx increases as well, until it wraps around. So you might set idx outside of the loop to the correct start value, then increase it in the loop body by one, and check if it is already equal buf.buf.len, in which case you set it back to zero. Most AIs will give you the full code. (The "if" should compile to a cmov assembly instruction, which is fast.)

There is also crates.io: Rust Package Registry which has a more complete api. Including all integer types and remainders

1 Like

There are a few useful references for variants of this optimization:

1 Like

In the real code I have several threads and use fetch_add(1), I did not include it to simplify the example. I could do something similar, but it would introduce branches to the hot loop. The optimized division also has branches, but they can be moved outside of the loop.

Also, it's worth noting that the crates typically can't make any assumptions on the bounds of the input, but if you have a tighter upper bounds on the numerator of an invariant division then you can typically avoid the 65-bit or 33-bit operations that are sometimes necessary. But it's really unusual for these microoptimizations to make a difference in runtime.