How does rust get this better loop vectorization?

yosefm · September 14, 2019, 8:21am

As a newcomer to Rust, I wanted to try out some of the higher level abstractions, to see if it fits my daily work as an algorithm developer. I started with something simple: allocate a 100x100 image of u32, draw a white circle in the middle, and then the benchmark: count all whites. This is the Rust code for the loop:

pub fn count_whites(flat_image: &[u32]) -> u32 {
    let mut count = 0;
    for pix in flat_image {
        if *pix == 0xffffffff {
            count += 1;
        }
    }
    
    count
}

I found the Rust version to be 2x faster than my c++ implementation:

unsigned int count_whites(const std::vector<unsigned int>& flat_image) {
    unsigned int count = 0;

    for (const unsigned int& elem : flat_image)
        if (elem == 0xffffffff)
            count++;

    return count;
}

(cargo bench for Rust, google benchmark for gcc)

Digging through the assembly, it looks like Rust (through LLVM?) uses more xmm registers for the loop unrolling. But on the other hand, it's possible that something was optimized away just for the benchmark or something. I used criterion::black_box in several places, couldn't get it to slow down the Rust version.

So, here's my question: is it a known thing that loop auto-vectorization is faster in LLVM compared to gcc? Or maybe just for a certain class of cases? Or is there any other reason for this surprisingly nice behaviour? I found some material on auto vectorization in LLVM, but no comparison with gcc. I would also appreciate homework in the form of links.

Update: I could get the same performance from gcc by using -mavx2, but that can't be the answer, since the Rust version doesn't use ymm registers.

alice · September 14, 2019, 1:06pm

Have you tried compiling the C code with clang?

yosefm · September 14, 2019, 1:15pm

Thanks for the suggestion. I tried it now and it seems to produce assembly code very similar to what Rust produces (both with -O3) and gets the same performance. So it seems that the answer is, LLVM has a better auto-vectorizer, at least for this case.

alice · September 14, 2019, 1:16pm

Yeah, this is typically the answer to this kind of question.

scottmcm · September 15, 2019, 12:48am

Anything about auto-vectorization is LLVM, so clang and rust tend to produce the same thing (modulo bounds checking and wrapping arithmetic and such).

For fun, note that you can get the good codegen with iterators too:

pub fn count_whites_2(flat_image: &[u32]) -> usize {
    flat_image
        .iter()
        .filter(|pix| **pix == 0xffffffff)
        .count()
}

vitalyd · September 15, 2019, 1:53am

I think the answer, not a very satisfying one, is LLVM does some things better than GCC and vice versa. You might’ve found a case where it’s the former - I wouldn’t draw any general conclusions though.

system · December 14, 2019, 2:08am

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Rust/LLVM Applies Auto Vectorization Inconsistently help	7	1348	December 10, 2021
Rust and SIMD vectorization help	4	816	September 2, 2020
How to see auto-vectorization in action?	5	795	August 5, 2020
Are vectorization failures due to Rust or LLVM? help	1	160	November 19, 2025
Rust autovectorization issues help	9	347	May 31, 2025

How does rust get this better loop vectorization?

Related topics