Match affects unrolling?


#1

I’ve got two functions looping over a buffer, counting non-whitespace bytes. One of them uses an iterator, the other uses indexed access (unchecked!):

fn iter(buffer: &[u8], len: usize) -> usize {
    let mut count = 0;
    for &byte in &buffer[0..len] {
        if !is_whitespace(byte) {
            count += 1;
        }
    }
    count
}

fn index_uncheck(buffer: &[u8], len: usize) -> usize {
    let mut count = 0;
    for index in 0..len {
        unsafe {
            if !is_whitespace(*buffer.get_unchecked(index)) {
                count += 1;
            }
        }
    }
    count
}

In theory, they should have almost identical performance but in practice iter() is ~13% faster on my test data. As @bluss on #rust channel suggested, the reason is index_uncheck() doesn’t get unrolled by an optimizer for some reason. And indeed, by looking into assembler output iter() has these two similar looking chunks, while index_uncheck() has only one.

.LBB0_7:
    mov    cl, byte ptr [r8]
    add    cl, -9
    movzx    edx, cl
    cmp    edx, 24
    jae    .LBB0_8
    mov    edx, 8388627
    shr    edx, cl
    not    edx
    and    edx, 1
    add    rax, rdx
    jmp    .LBB0_9
    .align    16, 0x90
.LBB0_8:
    inc    rax
.LBB0_9:
    mov    cl, byte ptr [r8 + 1]
    add    cl, -9
    movzx    edx, cl
    cmp    edx, 24
    jae    .LBB0_10
    mov    edx, 8388627
    shr    edx, cl
    not    edx
    and    edx, 1
    add    rax, rdx
    jmp    .LBB0_13
    .align    16, 0x90
.LBB0_10:
    inc    rax

And now for a really interesting twist :slight_smile:

Here’s the definition of is_whitespace():

#[inline(always)]
fn is_whitespace(value: u8) -> bool {
    match value {
        9 | 10 | 13 | 32 => true,
        _ => false,
    }
}

If I change it from using match to OR-ed condition:

#[inline(always)]
fn is_whitespace(value: u8) -> bool {
    value == 9 || value == 10 || value == 13 || value == 32
}

… then suddenly both types of looping get unrolled and start performing pretty much identically. Also, the assembler output looks drastically different, it’s now full of SIMD instructions on XMM registers. It doesn’t make any difference in final performance though…

Questions:

  • Why is it happening? Some uncontrollable LLVM magic?
  • Hypothetically speaking, where does one should start looking to fix it in the compiler? Or is MIR going to fix it automagically after ending world hunger and curing cancer? :slight_smile:

P.S. Forgot to mention, sorry: it’s on yesterday’s nightly with cargo --release


#2

Almost certainly, yes. LLVM will have heuristics for loop unrolling etc, that are apparently not triggering in the best way for this piece of code. It probably incorporates things like instruction counts, and so small tweaks that push it above/below the thresholds will result in different code.

You could try narrowing down the issue by, e.g. reducing the possibilities in the match and ||s or tweaking the values to find the simplest reproduction.

The Rust compiler does essentially no optimisation effort, it just relies on LLVM.

This may be an artifact of exactly how we decide to convert a match and the ||s to LLVM IR, and so making tweaks to that may address it.

This difference between || and match is likely to be LLVM is finding a local maximum for the match that causes it to miss a better global optimum: it optimises the match version in isolation before inlining into a form that can’t be SIMD’d after it is inlined, while the || version stays in a form that can be.

(Also, the autovectorised SIMD looks… ridiculously inefficient, it seems to be processing 4 bytes at a time with a pile of unnecessary data movement, when it could be doing 16 bytes with essentially no data movement.)


#3

Thanks! Couldn’t wish for a more comprehensive explanation.

In practical terms everything boils down to 1) profile bottlenecks 2) fiddle with different approaches. But it’s nice to know what can affect the heuristics.