Unexpected performance from array bound tests and more


#1

I’m implementing many solutions in Rust for Euler Problems, and I’m seeing many interesting phenomena.

All the code of this post is compiled with:

-C opt-level=3 -C prefer-dynamic -C target-cpu=native -Z orbit

(None of the code in this post was originally mine, I’ve converted it to Rust from other languages, and I’ve tried to improve it a little in various ways. Generally the resulting Rust code is hugely more clear).

This is a solution for Problem 154 ( https://projecteuler.net/problem=154 ):

fn e154() -> u32 {
    fn rep_div(mut n: u32, i: u32) -> u32 {
        let mut result = 0;
        while n % i == 0 {
            result += 1;
            n /= i;
        }
        return result;
    }

    const MAX: u32 = 200_000;
    const UMAX: usize = MAX as usize;
    let mut f5 = vec![0u32; UMAX + 1];
    let mut f2 = vec![0u32; UMAX + 1];

    for n in 1 .. MAX + 1 {
        f5[n as usize] = f5[n as usize - 1] + rep_div(n, 5);
        f2[n as usize] = f2[n as usize - 1] + rep_div(n, 2);
    }

    let fives = f5[UMAX] - 11;
    let twos = f2[UMAX] - 11;
    let mut answer = 0;

    for a in 0 .. MAX / 3 {
        let mut b = a;
        while a + 2 * b <= MAX {
            let c = MAX - a - b;
            let factors5 = f5[a as usize] + f5[b as usize] + f5[c as usize];
            let factors2 = f2[a as usize] + f2[b as usize] + f2[c as usize];

            if factors5 < fives && factors2 < twos {
                if a == c {
                    answer += 1;
                } else if a < b && b < c {
                    answer += 6;
                } else {
                    answer += 3;
                }
            }
            b += 1;
        }
    }

    answer
}

fn main() {
    assert_eq!(e154(), 479_742_450);
}

This is the same code, but the answer is accumulated with an intermediate step, before the final sum():

#![feature(iter_arith)]

fn e154() -> u32 {
    fn rep_div(mut n: u32, i: u32) -> u32 {
        let mut result = 0;
        while n % i == 0 {
            result += 1;
            n /= i;
        }
        return result;
    }

    const MAX: u32 = 200_000;
    const UMAX: usize = MAX as usize;
    let mut f5 = vec![0u32; UMAX + 1];
    let mut f2 = vec![0u32; UMAX + 1];

    for n in 1 .. MAX + 1 {
        f5[n as usize] = f5[n as usize - 1] + rep_div(n, 5);
        f2[n as usize] = f2[n as usize - 1] + rep_div(n, 2);
    }

    let fives = f5[UMAX] - 11;
    let twos = f2[UMAX] - 11;

    (0 .. MAX / 3)
    .map(|a| {
        let mut answer = 0;
        let mut b = a;
        while a + 2 * b <= MAX {
            let c = MAX - a - b;
            let factors5 = f5[a as usize] + f5[b as usize] + f5[c as usize];
            let factors2 = f2[a as usize] + f2[b as usize] + f2[c as usize];

            if factors5 < fives && factors2 < twos {
                if a == c {
                    answer += 1;
                } else if a < b && b < c {
                    answer += 6;
                } else {
                    answer += 3;
                }
            }
            b += 1;
        }
        answer
    })
    .sum()
}

fn main() {
    assert_eq!(e154(), 479_742_450);
}

For me it’s a surprise to see this second version is quite faster than the first (If you have multi-thread code, having a local accumulator is a significant advantage, but that’s single thread code). Do you have an idea of the causes?


This is a faster solution (it’s a jungle of casts, but if I use usize types to remove most of them, the code slows down):

fn e154() -> u32 {
    const L: u32 = 200_000;
    const POW10: u32 = 12;
    const L3: u32 = L / 3;

    let mut e2 = vec![0u32; L as usize + 1];

    let mut pow: u32 = 2;
    while pow <= L {
        let mut mult: u32 = pow;
        while mult <= L {
            e2[mult as usize] += 1;
            mult += pow;
        }
        pow *= 2;
    }

    let mut e5 = [0u32; L as usize + 1];

    pow = 5;
    while pow <= L {
        let mut mult = pow;
        while mult <= L {
            e5[mult as usize] += 1;
            mult += pow;
        }
        pow *= 5;
    }

    for i in 1u32 .. L + 1 {
        e2[i as usize] = e2[i as usize] + e2[i as usize - 1];
        e5[i as usize] = e5[i as usize] + e5[i as usize - 1];
    }

    let mut d5 = vec![0u32; L as usize + 1];

    for i in 0u32 .. L + 1 {
        d5[i as usize] = i - 4 * e5[i as usize];
    }

    // Find max val of d5.
    let mut maxd: u32 = 0;
    for i in 0u32 .. L + 1 {
        if d5[i as usize] > maxd {
            maxd = d5[i as usize];
        }
    }

    // nid[d] is the number of i for which d[i] = d.
    let mut nid = vec![0u32; maxd as usize + 1];

    for i in 0u32 .. L {
        nid[d5[i as usize] as usize] += 1;
    }

    // li[v] is the sorted list of i for which d5[i] = v.
    let mut li = (0 .. maxd as usize + 1)
                 .map(|i| vec![0u32; nid[i] as usize + 2])
                 .collect::<Vec<_>>();
    let mut starts = vec![0usize; maxd as usize + 1];

    for x in nid.iter_mut() { *x = 0; }

    for i in 0 .. L + 1 {
        let d5i = d5[i as usize] as usize;
        li[d5i][nid[d5i] as usize] = i;
        nid[d5i] += 1;
    }

    for i in 0 .. maxd + 1 {
        li[i as usize][nid[i as usize] as usize] = 1_000_000; // Sentinel.
    }

    let mut n_sols: u32 = 0;
    let mut l_minus_2i: u32 = L + 2;

    for i in 0 .. L3 + 1 {
        let dif: u32 = 4 * POW10 + d5[L as usize] - d5[i as usize];
        let mind: u32 = (dif + 1) / 2;
        l_minus_2i -= 2;
        for vald in mind .. maxd + 1 {
            let mut start = starts[vald as usize];
            while li[vald as usize][start] < i {
                start += 1;
            }
            starts[vald as usize] = start;

            for &j in li[vald as usize][start ..].iter() {
                if j > l_minus_2i {
                    break;
                }
                let dif_minus_vald: u32 = dif - vald;
                let k: u32 = L - i - j;

                if d5[k as usize] >= dif_minus_vald &&
                   e2[L as usize] - e2[i as usize] - e2[j as usize] - e2[k as usize] >= POW10 {
                    if j == k {
                        n_sols += 6;
                    } else {
                        if i == j || i == k {
                            if d5[k as usize] >= mind && vald >= dif_minus_vald {
                                n_sols += 3;
                            } else {
                                n_sols += 6;
                            }
                        } else {
                            if d5[k as usize] >= mind && vald >= dif_minus_vald {
                                n_sols += 6;
                            } else {
                                n_sols += 12;
                            }
                        }
                    }
                }
            }
        }
    }

    n_sols / 2
}

fn main() {
    assert_eq!(e154(), 479_742_450);
}

To experiment I’ve replaced the array accesses in the main part of the code with .get_unchecked()/.get_unchecked_mut():

fn e154() -> u32 {
    const L: u32 = 200_000;
    const POW10: u32 = 12;
    const L3: u32 = L / 3;

    let mut e2 = vec![0u32; L as usize + 1];

    let mut pow: u32 = 2;
    while pow <= L {
        let mut mult: u32 = pow;
        while mult <= L {
            e2[mult as usize] += 1;
            mult += pow;
        }
        pow *= 2;
    }

    let mut e5 = [0u32; L as usize + 1];

    pow = 5;
    while pow <= L {
        let mut mult = pow;
        while mult <= L {
            e5[mult as usize] += 1;
            mult += pow;
        }
        pow *= 5;
    }

    for i in 1u32 .. L + 1 {
        e2[i as usize] = e2[i as usize] + e2[i as usize - 1];
        e5[i as usize] = e5[i as usize] + e5[i as usize - 1];
    }

    let mut d5 = vec![0u32; L as usize + 1];

    for i in 0u32 .. L + 1 {
        d5[i as usize] = i - 4 * e5[i as usize];
    }

    // Find max val of d5.
    let mut maxd: u32 = 0;
    for i in 0u32 .. L + 1 {
        if d5[i as usize] > maxd {
            maxd = d5[i as usize];
        }
    }

    // nid[d] is the number of i for which d[i] = d.
    let mut nid = vec![0u32; maxd as usize + 1];

    for i in 0u32 .. L {
        nid[d5[i as usize] as usize] += 1;
    }

    // li[v] is the sorted list of i for which d5[i] = v.
    let mut li = (0 .. maxd as usize + 1)
                 .map(|i| vec![0u32; nid[i] as usize + 2])
                 .collect::<Vec<_>>();
    let mut starts = vec![0usize; maxd as usize + 1];

    for x in nid.iter_mut() { *x = 0; }

    for i in 0 .. L + 1 {
        let d5i = d5[i as usize] as usize;
        li[d5i][nid[d5i] as usize] = i;
        nid[d5i] += 1;
    }

    for i in 0 .. maxd + 1 {
        li[i as usize][nid[i as usize] as usize] = 1_000_000; // Sentinel.
    }

    let mut n_sols: u32 = 0;
    let mut l_minus_2i: u32 = L + 2;

    unsafe {
        for i in 0 .. L3 + 1 {
            let dif: u32 = 4 * POW10 + *d5.get_unchecked(L as usize) - *d5.get_unchecked(i as usize);
            let mind: u32 = (dif + 1) / 2;
            l_minus_2i -= 2;
            for vald in mind .. maxd + 1 {
                let mut start = *starts.get_unchecked(vald as usize);
                while *li.get_unchecked(vald as usize).get_unchecked(start) < i {
                    start += 1;
                }
                *starts.get_unchecked_mut(vald as usize) = start;

                for &j in li.get_unchecked(vald as usize)[start ..].iter() {
                    if j > l_minus_2i {
                        break;
                    }
                    let dif_minus_vald: u32 = dif - vald;
                    let k: u32 = L - i - j;

                    if *d5.get_unchecked(k as usize) >= dif_minus_vald &&
                       *e2.get_unchecked(L as usize) - *e2.get_unchecked(i as usize) -
                       *e2.get_unchecked(j as usize) - e2.get_unchecked(k as usize) >= POW10 {
                        if j == k {
                            n_sols += 6;
                        } else {
                            if i == j || i == k {
                                if *d5.get_unchecked(k as usize) >= mind && vald >= dif_minus_vald {
                                    n_sols += 3;
                                } else {
                                    n_sols += 6;
                                }
                            } else {
                                if *d5.get_unchecked(k as usize) >= mind && vald >= dif_minus_vald {
                                    n_sols += 6;
                                } else {
                                    n_sols += 12;
                                }
                            }
                        }
                    }
                }
            }
        }
    }

    n_sols / 2
}

fn main() {
    assert_eq!(e154(), 479_742_450);
}

This code with .get_unchecked() is slower than the regular checked code, do you know why? I’ve seen a similar outcome in many other situations.


#2

In your second example, I can’t be sure but I suspect the speed up might be CPU pipelining, the more operations you do that are “unrelated” the better the CPU can process them in paralllel while still being single threaded.


#3

Writing those Euler problem solutions in Rust (or converting them from other languages) is surely lot of fun.

This is a little Java function (from https://github.com/nayuki/Project-Euler-solutions/blob/master/java/p160.java ):

long factorialCoprime(long n) {
    n %= 100000;
    long product = 1;
    for (int i = 1; i <= n; i++) {
        if (i % 2 != 0 && i % 5 != 0)
            product = i * product % 100000;
    }
    return product;
}

At first sight you want to translate it to the straightforward Rust code:

fn factorial_coprime(mut n: i64) -> i64 {
    n %= 100_000;
    let mut product = 1;
    for i in 1 .. n + 1 {
        if i % 2 != 0 && i % 5 != 0 {
            product = i * product % 100_000;
        }
    }
    product
}

But it turns out the % among i64 is much slower even on a modern CPU (despite compilers implement %2 and %5 in a smart way), so the whole Rust program comes about twice slower. So in Rust I have to write a weird mixed-type while loop (I’ve also switched to unsigned values because there are no negative numbers in this program, but this doesn’t change the code performance much):

fn factorial_coprime(mut n: u64) -> u64 {
    n %= 100_000;
    let mut product = 1;
    let mut i: u32 = 1;
    while u64::from(i) <= n {
        if i % 2 != 0 && i % 5 != 0 {
            product = u64::from(i) * product % 100_000;
        }
        i += 1;
    }
    product
}

For this specific usage it’s even faster to compare u32 numbers:

fn factorial_coprime(mut n: u64) -> u64 {
    n %= 100_000;
    let mut product = 1;
    let mut i: u32 = 1;
    while i <= n as u32 {
        if i % 2 != 0 && i % 5 != 0 {
            product = u64::from(i) * product % 100_000;
        }
        i += 1;
    }
    product
}

This final version is just a bit slower than equivalent D code compiled with the LLVM back-end.

I program in system languages since lot of time, but the very strict type discipline of Rust is allowing me to better see the interactions between types and performance.


#4

Not an answer to first questions but hopefully informative;
Optimization and “nice” code in safe rust can be gained iterating over the vectors.

(I have only run basic timing --release so no idea what further compiler optimization offers.)
Changed from second solution;

#![feature(iter_arith)]

fn e154() -> u32 {
    fn rep_div(mut n: u32, i: u32) -> u32 {
        let mut result = 0;
        while n % i == 0 {
            result += 1;
            n /= i;
        }
        return result;
    }

    const MAX: u32 = 200_000;
    const UMAX: usize = MAX as usize;
    let mut f5 = vec![0u32; UMAX + 1];
    let mut f2 = vec![0u32; UMAX + 1];

    let mut prior = f5[0];
    for (n, v) in f5.iter_mut().enumerate().skip(1) {
        prior = prior + rep_div(n as u32, 5);
        *v = prior;
    }

    prior = f2[0];
    for (n, v) in f2.iter_mut().enumerate().skip(1) {
        prior = prior + rep_div(n as u32, 2);
        *v = prior;
    }

    let fives = f5[UMAX] - 11;
    let twos = f2[UMAX] - 11;

    f5.iter()
        .zip(f2.iter())
        .enumerate()
        .take(UMAX / 3)
        .map(|(a, (a5, a2))| {
            let mut answer = 0;
            let mut b = a;

            while a + 2 * b <= UMAX {
                let c = UMAX - a - b;
                let factors5 = a5 + f5[b] + f5[c];
                let factors2 = a2 + f2[b] + f2[c];

                if factors5 < fives && factors2 < twos {
                    if a == c {
                        answer += 1;
                    } else if a < b && b < c {
                        answer += 6;
                    } else {
                        answer += 3;
                    }
                }
                b += 1;
            }
            answer
        })
        .sum()
}

fn main() {
    assert_eq!(e154(), 479_742_450);
}

A simple cleaner change but runs slower than the while and b+=1

for b in (a..UMAX).take_while(|b| a + 2 * b <= UMAX) {

Finally wrote iterator avoiding second index access, runs much slower.

#![feature(iter_arith)]

fn e154() -> u32 {
    fn rep_div(mut n: u32, i: u32) -> u32 {
        let mut result = 0;
        while n % i == 0 {
            result += 1;
            n /= i;
        }
        return result;
    }

    const MAX: u32 = 200_000;
    const UMAX: usize = MAX as usize;
    let mut f5 = vec![0u32; UMAX + 1];
    let mut f2 = vec![0u32; UMAX + 1];

    let mut prior = f5[0];
    for (n, v) in f5.iter_mut().enumerate().skip(1) {
        prior = prior + rep_div(n as u32, 5);
        *v = prior;
    }

    prior = f2[0];
    for (n, v) in f2.iter_mut().enumerate().skip(1) {
        prior = prior + rep_div(n as u32, 2);
        *v = prior;
    }

    let fives = f5[UMAX] - 11;
    let twos = f2[UMAX] - 11;

    f5.iter()
        .zip(f2.iter())
        .enumerate()
        .take(UMAX / 3)
        .map(|(a, (a5, a2))| {
            let mut answer = 0;

            // b = a+d
            // doing skip after enumerate is slower
            // f5[a..].iter() is slower than f5.iter().skip(a)
            for (b, c, factors5, factors2) in f5.iter()
                .skip(a)
                .zip(f2.iter().skip(a))
                .enumerate()
                .take_while(|&(d, _)| a + 2 * (a + d) <= UMAX)
                // just to be different. map doesn't add much of value over keeping in block
                .map(|(d, (b5, b2))| {
                    let c = UMAX - a - (a + d);
                    // still with index access for c
                    (a + d, c, a5 + b5 + f5[c], a2 + b2 + f2[c])
                }) {

                if factors5 < fives && factors2 < twos {
                    if a == c {
                        answer += 1;
                    } else if a < b && b < c {
                        answers += 6;
                    } else {
                        answer += 3;
                    }
                }
            }
            answer
        })
        .sum()
}

fn main() {
    assert_eq!(e154(), 479_742_450);
}

#5

Doing the Euler problems is a goldmine of performance troubles. Now I’ve seen that the pow() function (used as 3u32.pow(5) or 5u64.pow(2)) is quite slow, and I guess it will need to be fixed. Currently I’m avoiding it, and this is bad. And I’d also like such pow() function to run at compile-time too, because I’d often like to use it to compute compile-time constants (D language avoids both problems).


This shows another performance problem, I am not sure if it can be avoided/solved. This solves the Euler problem 198 (again the original code is not mine). To me it looks like a perfect usage of f32::saturating_mul() because its result is used only if it’s smaller than SIZE (a 32 bit value), but I wasn’t able to use the saturating_mul() because (on 64 bit Windows, using the last nightly) it was “significantly slower” than using a 64 bit multiplication followed by a cast (in the “ambig_d as u32” line):

fn main() {
    const SIZE: u64 = 100_000_000;
    const BOUND_N: u32 = 1;
    const BOUND_D: u32 = 100;

    let mut stack = vec![];
    let mut left: (u32, u32) = (0, 1);
    let mut right: (u32, u32) = (1, 1);
    let mut count: u32 = 0;

    loop {
        // let ambig_d = left.1.saturating_mul(right.1 * 2); // Too slow.
        let ambig_d = u64::from(left.1) * u64::from(right.1 * 2);
        if ambig_d <= SIZE {
            let ambig_n = left.1 * right.0 + left.0 * right.1;
            if ambig_n * BOUND_D < BOUND_N * ambig_d as u32 {
                count += 1;
            }
            let med = (left.0 + right.0, left.1 + right.1);
            if med.0 * BOUND_D <= BOUND_N * med.1 {
                stack.push(left);
                left = med;
            } else {
                right = med;
            }
        } else {
            if let Some(aux) = stack.pop() {
                right = left;
                left = aux;
            } else {
                break;
            }
        }
    }

    println!("{}", count == 52_374_425);
}

By the way, I am finding the Rustc trivial_casts and trivial_numeric_casts warnings quite handy to keep my code clean, I keep them always active. Why are they disabled on default?


#6

Integer pow is slow? That’s strange. I took a quick look, and it’s probably worth filing a bug for that. It looks LLVM isn’t unrolling the loop correctly for some reason.


The current asm from your example with saturating_mul:

.LBB1_2:
	leal	(%r13,%r13), %ecx
	movl	%ebx, %eax
	mull	%ecx
	movl	%eax, %ecx
	cmovol	%r9d, %ecx
	cmpl	$100000001, %ecx
	movl	%r13d, %r14d
	movl	%r12d, %r15d
	jb	.LBB1_3

I’m not surprised this is worse than imulq+cmpq. Probably worth filing a bug. (Might need some benchmarking on other targets to figure out exactly how to tweak Rust and/or LLVM. I would guess this is specifically an x86-64 problem to some extent because IMULQ is ridiculously fast.)


trivial_casts and trivial_numeric_casts are off by default because the false positive rate was too high, especially without type ascription. See also https://github.com/rust-lang/rust/issues/23742 and https://github.com/rust-lang/rust/issues/23770 and https://github.com/rust-lang/rust/pull/23776 .


#7

Who is going to report this bug?


#8

I generally prefer to leave it for the person who discovered the issue to file a bug… but I’ll file it otherwise.


#9

I have not yet started to file Rust bugs. If I don’t file it “soon enough” (a day) feel free to submit it yourself.


#10

Please file all bugs you find! Thanks!


#11

Filed https://github.com/rust-lang/rust/issues/34947 , https://github.com/rust-lang/rust/issues/34948 .


#12

Also filed https://github.com/rust-lang/rust/issues/34949 .


#13

Thank you, very much! :slight_smile: