Removing panic generates more complex asm

I was trying to optimize my program when I found something I can't explain.

According to the Compiler Explorer (nightly, -C opt-level=3), f_unsafe generates many more instructions than f_safe:

pub fn f_safe(a: &mut [f64], b: &[f64]) {
    for i in 0..a.len() {
        a[i] += b[i];
    }
}

// assumes a and b have the same size
pub fn f_unsafe(a: &mut [f64], b: &[f64]) {
    for i in 0..a.len() {
        let a = unsafe{a.get_unchecked_mut(i)};
        *a += unsafe{b.get_unchecked(i)};
    }
}

I thought that f_unsafe would be exactly the same than f_safe, only without the length test and panic instructions. Why not? And why does this simple code take so many instructions?

I guess the compiler knows that a[i] is valid in the loop, so it should not introduce any panic. Replacing only b[i] by unsafe{b.get_unchecked(i)}, and keeping safe a[i] also generates longer ASM.

1 Like

It looks the optimizer was able to apply loop unrolling to f_unsafe, which makes it run faster, at the cost of more instructions in the generated code.

You'll probably see an even bigger speed boost if you use a -C target-cpu option that allows vectorization. (Example)

4 Likes

You might want to use opt-level=s (optimize for size) sometimes to render optimized assembly that is relatively human-readable. I use it frequently. You might give up a little performance, but then again, you might not -- the conventional wisdom is that -O2, -O3, and -Os (that is, -C opt-level=2, etc.) are all more or less equally good in different scenarios. I can't say I've tested it much myself, but I have seen cases where -O3 was slower than -O2, and I've heard of people using -Os to reduce instruction cache misses. Personally I usually start with opt-level=2 and fiddle with it if I find it's slow.

2 Likes

If I understand well, it generates several loops, each loop having a different number of iterations unrolled, and we choose the right loop depending on the slice's length?

If the slice is more likely to have a certain length (i.e. most of the time it will be 2 or 3), is there a way to tell the compiler what the "more optimized" length is, or do I have to implement it myself with if statements? (in that case I guess -Os may be the best option)

I believe it effectively generates just two loops, one that does N elements at a time, and then a second loop that does any remaining items at the end of the slice. Something like this:

pub fn f_unrolled(a: &mut [f64], b: &[f64]) {
    let mut i = 0;
    while i + 3 < a.len() {
        a[i + 0] += b[i + 0];
        a[i + 1] += b[i + 1];
        a[i + 2] += b[i + 2];
        a[i + 3] += b[i + 3];
        i += 4;
    }
    while i < a.len() {
        a[i] += b[i];
        i += 1;
    }
}
2 Likes

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.