What's up with the Rust compiler for ARM?

I have repeatedly found that code that is translated from C to Rust will perform pretty much the same on my x86-64 PC. Sometimes even better.

However on the ARM processor the Rust version performs substantially worse that the C.

Let's take a simple example. The recursive Fibonacci algorithm:

In C:

#include "stdio.h"

int fibonacci(int n) {
    if (n < 2) {
        return n;
    }
    return fibonacci(n - 1) + fibonacci(n - 2);
}

int main () {
    int n = 24;
    printf("fibo(%d) = %d\n", n, fibonacci(n));
}

In Rust:

fn fibonacci(n: i32) -> i32 {
    match n {
        0 => 0,
        1..=2 => 1,
        _ => fibonacci(n - 1) + fibonacci(n - 2),
    }
}

fn main () {
    let n = 24;
    println!("fibo({}) = {}", n, fibonacci(n));
}

For which I get run times like this on the PC:

$ rustc  -C opt-level=3   fibo.rs
$ time ./fibo
fibo(24) = 46368

real    0m0.021s
user    0m0.000s
sys     0m0.000s
$ gcc -Wall -O3 -o fibo_c  fibo.c
$ time ./fibo_c
fibo(24) = 46368

real    0m0.027s
user    0m0.000s
sys     0m0.016s

Meanwhile on the ARM of the Raspberry Pi 3 I get this:

pi@aalto-1:~ $ rustc  -C opt-level=3   fibo.rs
pi@aalto-1:~ $ time ./fibo
fibo(24) = 46368

real    0m0.011s
user    0m0.011s
sys     0m0.000s
$ gcc -Wall -O3 -o fibo_c  fibo.c
pi@aalto-1:~ $ time ./fibo_c
fibo(24) = 46368

real    0m0.008s
user    0m0.000s
sys     0m0.009s

Note how C and Rust change places in run time.

I know that it's rude, crude and a bit silly to measure such short execution times with "time" but I have played with this in other ways and with different programs and seen the same phenomena.

The latest case is a much more substantial C to Rust translation and described in this thread: Transcribing C code to Rust In that case Rust is 30% faster than C on x86-64 but 30% slower on the ARM.

This is a huge difference and I wonder what is going on?

1 Like

Apart from the obvious suggestion to "benchmark all the things", a good place to start with these kinds of questions is at the code generation level. Some tools like the compiler explorer can be helpful.

I don't have the output from GCC to compare, but here's the AArch64 assembly produced by rustc -C opt-level=3 --target=aarch64-unknown-linux-gnu: https://godbolt.org/z/zAZbju

Is sound advice. It also turns out to be a time consuming process. If I find the time I'll try and be more rigorous about it. I have at least three such C to Rust translations here to do that with.

Note that although the Raspberry Pi has a 64 bit CPU it's standard Raspbian operating, system rebuild of Debian, runs as 32 bit.

Godbolt is a wonderful thing. Other than counting instructions generated I would not have much of a handle on what might cause performance differences.

I've updated the workspace with GCC output for comparison: https://godbolt.org/z/5dx4r5

Looks like gcc is using tail call optimization. The second recursive call is being transformed into a simple loop.

1 Like

Moving up to a much bigger piece of code. I have a solution to the Project Euler problem #256 https://projecteuler.net/problem=256 written in C and an almost direct translation of it to Rust, down to using global variables like the C does!

The Rust version is in the playground: https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=9de916ee34d1e1767d1402eeb2f7f385

The C code it came from is here: http://fractal.math.unr.edu/~ejolson/pi/tatami/src/limited.c

When compiling this on the ARM of the Raspberry Pi with comparisons to Clang and GCC I get timings like this:

Compiler.   Options.                        Seconds

clang                                       5.09

clang       -march=native -mtune=native     4.36

gcc                                         4.57

gcc         -march=native -mtune=native     3.59

rustc                                       6.86

As you see the Rust version does very badly against even the slowest alternative.

Conversely when compiled on my x86-64 PC I get this:

Compiler.   Options.                        Seconds

clang                                       0.90

clang       -march=native -mtune=native     0.73

gcc                                         0.87

gcc         -march=native -mtune=native     0.90

rustc                                       0.62

Here we see the opposite, Rust handily outperforms all the competition. Which surprised me when I first saw it.

My original translation has no globals, it wraps everything into structs nicely. Another surprise being that doing that make almost no difference to performance.

Looks like I just have to accept that LLVM and hence Rust do not perform well on ARM?

How can I get those compiler options to LLVM using rustc or Cargo?

All timings made very crudely by using "time". But I have run these often enough to claim they are representative. The difference we are looking at are not small!