Performance of iterator over for-loops without boundry checks

I'm working on a project, I need to find the minimum fractional period of a string, I have two naive algorithms that should have similar speed but in my tests the second algorithm, which uses slice comparison, is almost twice as fast. These are the two algorithms:

pub fn period_naive1(s: &[u8]) -> usize {
    let n = s.len();

    'outer: for i in 1..n {
        for j in 0..n - i {
            unsafe {
                if s.get_unchecked(j) != s.get_unchecked(j + i) {
                    continue 'outer;
                }
            }
        }
        return i;
    }

    n
}

pub fn period_naive2(s: &[u8]) -> usize {
    let n = s.len();
    for i in 1..n {
        if s[..n - i] == s[i..] {
            return i;
        }
    }
    n
}

The main difference is that the second algorithm to compare the slices uses iterators, I searched the net for what could be the cause and they seemed to be the boundary checks on the slices, but there doesn't seem to be any improvement.

Why do the two algorithms differ so much in speed?

Almost certainly because the second eventually calls memcmp which is likely hand-optimised assembly making full use of platform-specific intrinsics.

Edit: Specifically, I think the relevant bit of code is this particular implementation in the standard library.

4 Likes

Rustc is very good at optimizing idiomatic code. As to why exactly the second is faster, to me it looks like the slice comparison in the second is far less expensive (there's no inner loop in the assembly at all AFAICT), but I'm not good at reading assembly. Here's the godbolt of your snippet, if you care to explore yourself.

2 Likes

Thanks for your replies @jofas @DanielKeep, now the reason for the performance difference is much clearer.

1 Like

As a meta-answer, LLVM is generally not as good at optimizing loops that have multiple exit conditions -- like that one with a continue in it.

So the first algorithm ends up with a very basic byte-per-iteration inner loop, while the second dispatches to a function that's a heavily optimized implementation.

For example, compare these two functions: https://rust.godbolt.org/z/6En5bY4Gh. The former ends up in a byte-at-a-time loop, whereas the latter is rather more complicated but does a 16-bytes-at-once loop.

7 Likes

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.