Unexpected performance from array bound tests and more

leonardo · June 30, 2016, 9:39am

I'm implementing many solutions in Rust for Euler Problems, and I'm seeing many interesting phenomena.

All the code of this post is compiled with:

-C opt-level=3 -C prefer-dynamic -C target-cpu=native -Z orbit

(None of the code in this post was originally mine, I've converted it to Rust from other languages, and I've tried to improve it a little in various ways. Generally the resulting Rust code is hugely more clear).

This is a solution for Problem 154 ( #154 Exploring Pascal's pyramid - Project Euler ):

fn e154() -> u32 {
    fn rep_div(mut n: u32, i: u32) -> u32 {
        let mut result = 0;
        while n % i == 0 {
            result += 1;
            n /= i;
        }
        return result;
    }

    const MAX: u32 = 200_000;
    const UMAX: usize = MAX as usize;
    let mut f5 = vec![0u32; UMAX + 1];
    let mut f2 = vec![0u32; UMAX + 1];

    for n in 1 .. MAX + 1 {
        f5[n as usize] = f5[n as usize - 1] + rep_div(n, 5);
        f2[n as usize] = f2[n as usize - 1] + rep_div(n, 2);
    }

    let fives = f5[UMAX] - 11;
    let twos = f2[UMAX] - 11;
    let mut answer = 0;

    for a in 0 .. MAX / 3 {
        let mut b = a;
        while a + 2 * b <= MAX {
            let c = MAX - a - b;
            let factors5 = f5[a as usize] + f5[b as usize] + f5[c as usize];
            let factors2 = f2[a as usize] + f2[b as usize] + f2[c as usize];

            if factors5 < fives && factors2 < twos {
                if a == c {
                    answer += 1;
                } else if a < b && b < c {
                    answer += 6;
                } else {
                    answer += 3;
                }
            }
            b += 1;
        }
    }

    answer
}

fn main() {
    assert_eq!(e154(), 479_742_450);
}

This is the same code, but the answer is accumulated with an intermediate step, before the final sum():

#![feature(iter_arith)]

fn e154() -> u32 {
    fn rep_div(mut n: u32, i: u32) -> u32 {
        let mut result = 0;
        while n % i == 0 {
            result += 1;
            n /= i;
        }
        return result;
    }

    const MAX: u32 = 200_000;
    const UMAX: usize = MAX as usize;
    let mut f5 = vec![0u32; UMAX + 1];
    let mut f2 = vec![0u32; UMAX + 1];

    for n in 1 .. MAX + 1 {
        f5[n as usize] = f5[n as usize - 1] + rep_div(n, 5);
        f2[n as usize] = f2[n as usize - 1] + rep_div(n, 2);
    }

    let fives = f5[UMAX] - 11;
    let twos = f2[UMAX] - 11;

    (0 .. MAX / 3)
    .map(|a| {
        let mut answer = 0;
        let mut b = a;
        while a + 2 * b <= MAX {
            let c = MAX - a - b;
            let factors5 = f5[a as usize] + f5[b as usize] + f5[c as usize];
            let factors2 = f2[a as usize] + f2[b as usize] + f2[c as usize];

            if factors5 < fives && factors2 < twos {
                if a == c {
                    answer += 1;
                } else if a < b && b < c {
                    answer += 6;
                } else {
                    answer += 3;
                }
            }
            b += 1;
        }
        answer
    })
    .sum()
}

fn main() {
    assert_eq!(e154(), 479_742_450);
}

For me it's a surprise to see this second version is quite faster than the first (If you have multi-thread code, having a local accumulator is a significant advantage, but that's single thread code). Do you have an idea of the causes?

This is a faster solution (it's a jungle of casts, but if I use usize types to remove most of them, the code slows down):

fn e154() -> u32 {
    const L: u32 = 200_000;
    const POW10: u32 = 12;
    const L3: u32 = L / 3;

    let mut e2 = vec![0u32; L as usize + 1];

    let mut pow: u32 = 2;
    while pow <= L {
        let mut mult: u32 = pow;
        while mult <= L {
            e2[mult as usize] += 1;
            mult += pow;
        }
        pow *= 2;
    }

    let mut e5 = [0u32; L as usize + 1];

    pow = 5;
    while pow <= L {
        let mut mult = pow;
        while mult <= L {
            e5[mult as usize] += 1;
            mult += pow;
        }
        pow *= 5;
    }

    for i in 1u32 .. L + 1 {
        e2[i as usize] = e2[i as usize] + e2[i as usize - 1];
        e5[i as usize] = e5[i as usize] + e5[i as usize - 1];
    }

    let mut d5 = vec![0u32; L as usize + 1];

    for i in 0u32 .. L + 1 {
        d5[i as usize] = i - 4 * e5[i as usize];
    }

    // Find max val of d5.
    let mut maxd: u32 = 0;
    for i in 0u32 .. L + 1 {
        if d5[i as usize] > maxd {
            maxd = d5[i as usize];
        }
    }

    // nid[d] is the number of i for which d[i] = d.
    let mut nid = vec![0u32; maxd as usize + 1];

    for i in 0u32 .. L {
        nid[d5[i as usize] as usize] += 1;
    }

    // li[v] is the sorted list of i for which d5[i] = v.
    let mut li = (0 .. maxd as usize + 1)
                 .map(|i| vec![0u32; nid[i] as usize + 2])
                 .collect::<Vec<_>>();
    let mut starts = vec![0usize; maxd as usize + 1];

    for x in nid.iter_mut() { *x = 0; }

    for i in 0 .. L + 1 {
        let d5i = d5[i as usize] as usize;
        li[d5i][nid[d5i] as usize] = i;
        nid[d5i] += 1;
    }

    for i in 0 .. maxd + 1 {
        li[i as usize][nid[i as usize] as usize] = 1_000_000; // Sentinel.
    }

    let mut n_sols: u32 = 0;
    let mut l_minus_2i: u32 = L + 2;

    for i in 0 .. L3 + 1 {
        let dif: u32 = 4 * POW10 + d5[L as usize] - d5[i as usize];
        let mind: u32 = (dif + 1) / 2;
        l_minus_2i -= 2;
        for vald in mind .. maxd + 1 {
            let mut start = starts[vald as usize];
            while li[vald as usize][start] < i {
                start += 1;
            }
            starts[vald as usize] = start;

            for &j in li[vald as usize][start ..].iter() {
                if j > l_minus_2i {
                    break;
                }
                let dif_minus_vald: u32 = dif - vald;
                let k: u32 = L - i - j;

                if d5[k as usize] >= dif_minus_vald &&
                   e2[L as usize] - e2[i as usize] - e2[j as usize] - e2[k as usize] >= POW10 {
                    if j == k {
                        n_sols += 6;
                    } else {
                        if i == j || i == k {
                            if d5[k as usize] >= mind && vald >= dif_minus_vald {
                                n_sols += 3;
                            } else {
                                n_sols += 6;
                            }
                        } else {
                            if d5[k as usize] >= mind && vald >= dif_minus_vald {
                                n_sols += 6;
                            } else {
                                n_sols += 12;
                            }
                        }
                    }
                }
            }
        }
    }

    n_sols / 2
}

fn main() {
    assert_eq!(e154(), 479_742_450);
}

To experiment I've replaced the array accesses in the main part of the code with .get_unchecked()/.get_unchecked_mut():

fn e154() -> u32 {
    const L: u32 = 200_000;
    const POW10: u32 = 12;
    const L3: u32 = L / 3;

    let mut e2 = vec![0u32; L as usize + 1];

    let mut pow: u32 = 2;
    while pow <= L {
        let mut mult: u32 = pow;
        while mult <= L {
            e2[mult as usize] += 1;
            mult += pow;
        }
        pow *= 2;
    }

    let mut e5 = [0u32; L as usize + 1];

    pow = 5;
    while pow <= L {
        let mut mult = pow;
        while mult <= L {
            e5[mult as usize] += 1;
            mult += pow;
        }
        pow *= 5;
    }

    for i in 1u32 .. L + 1 {
        e2[i as usize] = e2[i as usize] + e2[i as usize - 1];
        e5[i as usize] = e5[i as usize] + e5[i as usize - 1];
    }

    let mut d5 = vec![0u32; L as usize + 1];

    for i in 0u32 .. L + 1 {
        d5[i as usize] = i - 4 * e5[i as usize];
    }

    // Find max val of d5.
    let mut maxd: u32 = 0;
    for i in 0u32 .. L + 1 {
        if d5[i as usize] > maxd {
            maxd = d5[i as usize];
        }
    }

    // nid[d] is the number of i for which d[i] = d.
    let mut nid = vec![0u32; maxd as usize + 1];

    for i in 0u32 .. L {
        nid[d5[i as usize] as usize] += 1;
    }

    // li[v] is the sorted list of i for which d5[i] = v.
    let mut li = (0 .. maxd as usize + 1)
                 .map(|i| vec![0u32; nid[i] as usize + 2])
                 .collect::<Vec<_>>();
    let mut starts = vec![0usize; maxd as usize + 1];

    for x in nid.iter_mut() { *x = 0; }

    for i in 0 .. L + 1 {
        let d5i = d5[i as usize] as usize;
        li[d5i][nid[d5i] as usize] = i;
        nid[d5i] += 1;
    }

    for i in 0 .. maxd + 1 {
        li[i as usize][nid[i as usize] as usize] = 1_000_000; // Sentinel.
    }

    let mut n_sols: u32 = 0;
    let mut l_minus_2i: u32 = L + 2;

    unsafe {
        for i in 0 .. L3 + 1 {
            let dif: u32 = 4 * POW10 + *d5.get_unchecked(L as usize) - *d5.get_unchecked(i as usize);
            let mind: u32 = (dif + 1) / 2;
            l_minus_2i -= 2;
            for vald in mind .. maxd + 1 {
                let mut start = *starts.get_unchecked(vald as usize);
                while *li.get_unchecked(vald as usize).get_unchecked(start) < i {
                    start += 1;
                }
                *starts.get_unchecked_mut(vald as usize) = start;

                for &j in li.get_unchecked(vald as usize)[start ..].iter() {
                    if j > l_minus_2i {
                        break;
                    }
                    let dif_minus_vald: u32 = dif - vald;
                    let k: u32 = L - i - j;

                    if *d5.get_unchecked(k as usize) >= dif_minus_vald &&
                       *e2.get_unchecked(L as usize) - *e2.get_unchecked(i as usize) -
                       *e2.get_unchecked(j as usize) - e2.get_unchecked(k as usize) >= POW10 {
                        if j == k {
                            n_sols += 6;
                        } else {
                            if i == j || i == k {
                                if *d5.get_unchecked(k as usize) >= mind && vald >= dif_minus_vald {
                                    n_sols += 3;
                                } else {
                                    n_sols += 6;
                                }
                            } else {
                                if *d5.get_unchecked(k as usize) >= mind && vald >= dif_minus_vald {
                                    n_sols += 6;
                                } else {
                                    n_sols += 12;
                                }
                            }
                        }
                    }
                }
            }
        }
    }

    n_sols / 2
}

fn main() {
    assert_eq!(e154(), 479_742_450);
}

This code with .get_unchecked() is slower than the regular checked code, do you know why? I've seen a similar outcome in many other situations.

pixel · June 30, 2016, 11:50am

In your second example, I can't be sure but I suspect the speed up might be CPU pipelining, the more operations you do that are "unrelated" the better the CPU can process them in paralllel while still being single threaded.

leonardo · July 1, 2016, 4:06pm

Writing those Euler problem solutions in Rust (or converting them from other languages) is surely lot of fun.

This is a little Java function (from Project-Euler-solutions/p160.java at master · nayuki/Project-Euler-solutions · GitHub ):

long factorialCoprime(long n) {
    n %= 100000;
    long product = 1;
    for (int i = 1; i <= n; i++) {
        if (i % 2 != 0 && i % 5 != 0)
            product = i * product % 100000;
    }
    return product;
}

At first sight you want to translate it to the straightforward Rust code:

fn factorial_coprime(mut n: i64) -> i64 {
    n %= 100_000;
    let mut product = 1;
    for i in 1 .. n + 1 {
        if i % 2 != 0 && i % 5 != 0 {
            product = i * product % 100_000;
        }
    }
    product
}

But it turns out the % among i64 is much slower even on a modern CPU (despite compilers implement %2 and %5 in a smart way), so the whole Rust program comes about twice slower. So in Rust I have to write a weird mixed-type while loop (I've also switched to unsigned values because there are no negative numbers in this program, but this doesn't change the code performance much):

fn factorial_coprime(mut n: u64) -> u64 {
    n %= 100_000;
    let mut product = 1;
    let mut i: u32 = 1;
    while u64::from(i) <= n {
        if i % 2 != 0 && i % 5 != 0 {
            product = u64::from(i) * product % 100_000;
        }
        i += 1;
    }
    product
}

For this specific usage it's even faster to compare u32 numbers:

fn factorial_coprime(mut n: u64) -> u64 {
    n %= 100_000;
    let mut product = 1;
    let mut i: u32 = 1;
    while i <= n as u32 {
        if i % 2 != 0 && i % 5 != 0 {
            product = u64::from(i) * product % 100_000;
        }
        i += 1;
    }
    product
}

This final version is just a bit slower than equivalent D code compiled with the LLVM back-end.

I program in system languages since lot of time, but the very strict type discipline of Rust is allowing me to better see the interactions between types and performance.

jonh · July 2, 2016, 4:18pm

Not an answer to first questions but hopefully informative;
Optimization and "nice" code in safe rust can be gained iterating over the vectors.

(I have only run basic timing --release so no idea what further compiler optimization offers.)
Changed from second solution;

#![feature(iter_arith)]

fn e154() -> u32 {
    fn rep_div(mut n: u32, i: u32) -> u32 {
        let mut result = 0;
        while n % i == 0 {
            result += 1;
            n /= i;
        }
        return result;
    }

    const MAX: u32 = 200_000;
    const UMAX: usize = MAX as usize;
    let mut f5 = vec![0u32; UMAX + 1];
    let mut f2 = vec![0u32; UMAX + 1];

    let mut prior = f5[0];
    for (n, v) in f5.iter_mut().enumerate().skip(1) {
        prior = prior + rep_div(n as u32, 5);
        *v = prior;
    }

    prior = f2[0];
    for (n, v) in f2.iter_mut().enumerate().skip(1) {
        prior = prior + rep_div(n as u32, 2);
        *v = prior;
    }

    let fives = f5[UMAX] - 11;
    let twos = f2[UMAX] - 11;

    f5.iter()
        .zip(f2.iter())
        .enumerate()
        .take(UMAX / 3)
        .map(|(a, (a5, a2))| {
            let mut answer = 0;
            let mut b = a;

            while a + 2 * b <= UMAX {
                let c = UMAX - a - b;
                let factors5 = a5 + f5[b] + f5[c];
                let factors2 = a2 + f2[b] + f2[c];

                if factors5 < fives && factors2 < twos {
                    if a == c {
                        answer += 1;
                    } else if a < b && b < c {
                        answer += 6;
                    } else {
                        answer += 3;
                    }
                }
                b += 1;
            }
            answer
        })
        .sum()
}

fn main() {
    assert_eq!(e154(), 479_742_450);
}

A simple cleaner change but runs slower than the while and b+=1

for b in (a..UMAX).take_while(|b| a + 2 * b <= UMAX) {

Finally wrote iterator avoiding second index access, runs much slower.

#![feature(iter_arith)]

fn e154() -> u32 {
    fn rep_div(mut n: u32, i: u32) -> u32 {
        let mut result = 0;
        while n % i == 0 {
            result += 1;
            n /= i;
        }
        return result;
    }

    const MAX: u32 = 200_000;
    const UMAX: usize = MAX as usize;
    let mut f5 = vec![0u32; UMAX + 1];
    let mut f2 = vec![0u32; UMAX + 1];

    let mut prior = f5[0];
    for (n, v) in f5.iter_mut().enumerate().skip(1) {
        prior = prior + rep_div(n as u32, 5);
        *v = prior;
    }

    prior = f2[0];
    for (n, v) in f2.iter_mut().enumerate().skip(1) {
        prior = prior + rep_div(n as u32, 2);
        *v = prior;
    }

    let fives = f5[UMAX] - 11;
    let twos = f2[UMAX] - 11;

    f5.iter()
        .zip(f2.iter())
        .enumerate()
        .take(UMAX / 3)
        .map(|(a, (a5, a2))| {
            let mut answer = 0;

            // b = a+d
            // doing skip after enumerate is slower
            // f5[a..].iter() is slower than f5.iter().skip(a)
            for (b, c, factors5, factors2) in f5.iter()
                .skip(a)
                .zip(f2.iter().skip(a))
                .enumerate()
                .take_while(|&(d, _)| a + 2 * (a + d) <= UMAX)
                // just to be different. map doesn't add much of value over keeping in block
                .map(|(d, (b5, b2))| {
                    let c = UMAX - a - (a + d);
                    // still with index access for c
                    (a + d, c, a5 + b5 + f5[c], a2 + b2 + f2[c])
                }) {

                if factors5 < fives && factors2 < twos {
                    if a == c {
                        answer += 1;
                    } else if a < b && b < c {
                        answers += 6;
                    } else {
                        answer += 3;
                    }
                }
            }
            answer
        })
        .sum()
}

fn main() {
    assert_eq!(e154(), 479_742_450);
}

leonardo · July 17, 2016, 5:38pm

Doing the Euler problems is a goldmine of performance troubles. Now I've seen that the pow() function (used as 3u32.pow(5) or 5u64.pow(2)) is quite slow, and I guess it will need to be fixed. Currently I'm avoiding it, and this is bad. And I'd also like such pow() function to run at compile-time too, because I'd often like to use it to compute compile-time constants (D language avoids both problems).

This shows another performance problem, I am not sure if it can be avoided/solved. This solves the Euler problem 198 (again the original code is not mine). To me it looks like a perfect usage of f32::saturating_mul() because its result is used only if it's smaller than SIZE (a 32 bit value), but I wasn't able to use the saturating_mul() because (on 64 bit Windows, using the last nightly) it was "significantly slower" than using a 64 bit multiplication followed by a cast (in the "ambig_d as u32" line):

fn main() {
    const SIZE: u64 = 100_000_000;
    const BOUND_N: u32 = 1;
    const BOUND_D: u32 = 100;

    let mut stack = vec![];
    let mut left: (u32, u32) = (0, 1);
    let mut right: (u32, u32) = (1, 1);
    let mut count: u32 = 0;

    loop {
        // let ambig_d = left.1.saturating_mul(right.1 * 2); // Too slow.
        let ambig_d = u64::from(left.1) * u64::from(right.1 * 2);
        if ambig_d <= SIZE {
            let ambig_n = left.1 * right.0 + left.0 * right.1;
            if ambig_n * BOUND_D < BOUND_N * ambig_d as u32 {
                count += 1;
            }
            let med = (left.0 + right.0, left.1 + right.1);
            if med.0 * BOUND_D <= BOUND_N * med.1 {
                stack.push(left);
                left = med;
            } else {
                right = med;
            }
        } else {
            if let Some(aux) = stack.pop() {
                right = left;
                left = aux;
            } else {
                break;
            }
        }
    }

    println!("{}", count == 52_374_425);
}

By the way, I am finding the Rustc trivial_casts and trivial_numeric_casts warnings quite handy to keep my code clean, I keep them always active. Why are they disabled on default?

eefriedman · July 17, 2016, 7:17pm

Integer pow is slow? That's strange. I took a quick look, and it's probably worth filing a bug for that. It looks LLVM isn't unrolling the loop correctly for some reason.

The current asm from your example with saturating_mul:

.LBB1_2:
	leal	(%r13,%r13), %ecx
	movl	%ebx, %eax
	mull	%ecx
	movl	%eax, %ecx
	cmovol	%r9d, %ecx
	cmpl	$100000001, %ecx
	movl	%r13d, %r14d
	movl	%r12d, %r15d
	jb	.LBB1_3

I'm not surprised this is worse than imulq+cmpq. Probably worth filing a bug. (Might need some benchmarking on other targets to figure out exactly how to tweak Rust and/or LLVM. I would guess this is specifically an x86-64 problem to some extent because IMULQ is ridiculously fast.)

trivial_casts and trivial_numeric_casts are off by default because the false positive rate was too high, especially without type ascription. See also https://github.com/rust-lang/rust/issues/23742 and https://github.com/rust-lang/rust/issues/23770 and https://github.com/rust-lang/rust/pull/23776 .

StefanoD · July 17, 2016, 9:38pm

Who is going to report this bug?

eefriedman · July 17, 2016, 9:47pm

I generally prefer to leave it for the person who discovered the issue to file a bug... but I'll file it otherwise.

leonardo · July 17, 2016, 10:12pm

I have not yet started to file Rust bugs. If I don't file it "soon enough" (a day) feel free to submit it yourself.

StefanoD · July 17, 2016, 10:42pm

Please file all bugs you find! Thanks!

eefriedman · July 21, 2016, 3:04am

Filed https://github.com/rust-lang/rust/issues/34947 , https://github.com/rust-lang/rust/issues/34948 .

eefriedman · July 21, 2016, 4:02am

Also filed https://github.com/rust-lang/rust/issues/34949 .

StefanoD · July 21, 2016, 5:29am

Thank you, very much!

Topic		Replies	Views
Rust newbie: Algorithm performance question code review	11	683	November 15, 2020
Looking for help understanding Rust's performance vs C++ community	28	6968	November 1, 2019
A performance problem compared with Julia	22	3073	February 23, 2021
Why my Rust multithreaded solution is slow as compared to the same c++ solution? code review	7	805	September 15, 2023
Fast Euler Solutions in Rust	1	1971	January 12, 2023

Unexpected performance from array bound tests and more

Related Topics