As a newcomer to Rust, I wanted to try out some of the higher level abstractions, to see if it fits my daily work as an algorithm developer. I started with something simple: allocate a 100x100 image of u32, draw a white circle in the middle, and then the benchmark: count all whites. This is the Rust code for the loop:
pub fn count_whites(flat_image: &[u32]) -> u32 {
let mut count = 0;
for pix in flat_image {
if *pix == 0xffffffff {
count += 1;
}
}
count
}
I found the Rust version to be 2x faster than my c++ implementation:
unsigned int count_whites(const std::vector<unsigned int>& flat_image) {
unsigned int count = 0;
for (const unsigned int& elem : flat_image)
if (elem == 0xffffffff)
count++;
return count;
}
(cargo bench for Rust, google benchmark for gcc)
Digging through the assembly, it looks like Rust (through LLVM?) uses more xmm registers for the loop unrolling. But on the other hand, it's possible that something was optimized away just for the benchmark or something. I used criterion::black_box in several places, couldn't get it to slow down the Rust version.
So, here's my question: is it a known thing that loop auto-vectorization is faster in LLVM compared to gcc? Or maybe just for a certain class of cases? Or is there any other reason for this surprisingly nice behaviour? I found some material on auto vectorization in LLVM, but no comparison with gcc. I would also appreciate homework in the form of links.
Update: I could get the same performance from gcc by using -mavx2, but that can't be the answer, since the Rust version doesn't use ymm registers.