How does rust get this better loop vectorization?

As a newcomer to Rust, I wanted to try out some of the higher level abstractions, to see if it fits my daily work as an algorithm developer. I started with something simple: allocate a 100x100 image of u32, draw a white circle in the middle, and then the benchmark: count all whites. This is the Rust code for the loop:

pub fn count_whites(flat_image: &[u32]) -> u32 {
    let mut count = 0;
    for pix in flat_image {
        if *pix == 0xffffffff {
            count += 1;
        }
    }
    
    count
}

I found the Rust version to be 2x faster than my c++ implementation:

unsigned int count_whites(const std::vector<unsigned int>& flat_image) {
    unsigned int count = 0;

    for (const unsigned int& elem : flat_image)
        if (elem == 0xffffffff)
            count++;

    return count;
}

(cargo bench for Rust, google benchmark for gcc)

Digging through the assembly, it looks like Rust (through LLVM?) uses more xmm registers for the loop unrolling. But on the other hand, it's possible that something was optimized away just for the benchmark or something. I used criterion::black_box in several places, couldn't get it to slow down the Rust version.

So, here's my question: is it a known thing that loop auto-vectorization is faster in LLVM compared to gcc? Or maybe just for a certain class of cases? Or is there any other reason for this surprisingly nice behaviour? I found some material on auto vectorization in LLVM, but no comparison with gcc. I would also appreciate homework in the form of links.

Update: I could get the same performance from gcc by using -mavx2, but that can't be the answer, since the Rust version doesn't use ymm registers.

Have you tried compiling the C code with clang?

2 Likes

Thanks for the suggestion. I tried it now and it seems to produce assembly code very similar to what Rust produces (both with -O3) and gets the same performance. So it seems that the answer is, LLVM has a better auto-vectorizer, at least for this case.

1 Like

Yeah, this is typically the answer to this kind of question.

1 Like

Anything about auto-vectorization is LLVM, so clang and rust tend to produce the same thing (modulo bounds checking and wrapping arithmetic and such).

For fun, note that you can get the good codegen with iterators too:

pub fn count_whites_2(flat_image: &[u32]) -> usize {
    flat_image
        .iter()
        .filter(|pix| **pix == 0xffffffff)
        .count()
}

I think the answer, not a very satisfying one, is LLVM does some things better than GCC and vice versa. You might’ve found a case where it’s the former - I wouldn’t draw any general conclusions though.

1 Like