What's up with the Rust compiler for ARM?

I have repeatedly found that code that is translated from C to Rust will perform pretty much the same on my x86-64 PC. Sometimes even better.

However on the ARM processor the Rust version performs substantially worse that the C.

Let's take a simple example. The recursive Fibonacci algorithm:

In C:

#include "stdio.h"

int fibonacci(int n) {
    if (n < 2) {
        return n;
    }
    return fibonacci(n - 1) + fibonacci(n - 2);
}

int main () {
    int n = 24;
    printf("fibo(%d) = %d\n", n, fibonacci(n));
}

In Rust:

fn fibonacci(n: i32) -> i32 {
    match n {
        0 => 0,
        1..=2 => 1,
        _ => fibonacci(n - 1) + fibonacci(n - 2),
    }
}

fn main () {
    let n = 24;
    println!("fibo({}) = {}", n, fibonacci(n));
}

For which I get run times like this on the PC:

$ rustc  -C opt-level=3   fibo.rs
$ time ./fibo
fibo(24) = 46368

real    0m0.021s
user    0m0.000s
sys     0m0.000s
$ gcc -Wall -O3 -o fibo_c  fibo.c
$ time ./fibo_c
fibo(24) = 46368

real    0m0.027s
user    0m0.000s
sys     0m0.016s

Meanwhile on the ARM of the Raspberry Pi 3 I get this:

pi@aalto-1:~ $ rustc  -C opt-level=3   fibo.rs
pi@aalto-1:~ $ time ./fibo
fibo(24) = 46368

real    0m0.011s
user    0m0.011s
sys     0m0.000s
$ gcc -Wall -O3 -o fibo_c  fibo.c
pi@aalto-1:~ $ time ./fibo_c
fibo(24) = 46368

real    0m0.008s
user    0m0.000s
sys     0m0.009s

Note how C and Rust change places in run time.

I know that it's rude, crude and a bit silly to measure such short execution times with "time" but I have played with this in other ways and with different programs and seen the same phenomena.

The latest case is a much more substantial C to Rust translation and described in this thread: Transcribing C code to Rust In that case Rust is 30% faster than C on x86-64 but 30% slower on the ARM.

This is a huge difference and I wonder what is going on?

1 Like

Apart from the obvious suggestion to "benchmark all the things", a good place to start with these kinds of questions is at the code generation level. Some tools like the compiler explorer can be helpful.

I don't have the output from GCC to compare, but here's the AArch64 assembly produced by rustc -C opt-level=3 --target=aarch64-unknown-linux-gnu: https://godbolt.org/z/zAZbju

Is sound advice. It also turns out to be a time consuming process. If I find the time I'll try and be more rigorous about it. I have at least three such C to Rust translations here to do that with.

Note that although the Raspberry Pi has a 64 bit CPU it's standard Raspbian operating, system rebuild of Debian, runs as 32 bit.

Godbolt is a wonderful thing. Other than counting instructions generated I would not have much of a handle on what might cause performance differences.

I've updated the workspace with GCC output for comparison: https://godbolt.org/z/5dx4r5

Looks like gcc is using tail call optimization. The second recursive call is being transformed into a simple loop.

1 Like

Moving up to a much bigger piece of code. I have a solution to the Project Euler problem #256 https://projecteuler.net/problem=256 written in C and an almost direct translation of it to Rust, down to using global variables like the C does!

The Rust version is in the playground: https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=9de916ee34d1e1767d1402eeb2f7f385

The C code it came from is here: http://fractal.math.unr.edu/~ejolson/pi/tatami/src/limited.c

When compiling this on the ARM of the Raspberry Pi with comparisons to Clang and GCC I get timings like this:

Compiler.   Options.                        Seconds

clang                                       5.09

clang       -march=native -mtune=native     4.36

gcc                                         4.57

gcc         -march=native -mtune=native     3.59

rustc                                       6.86

As you see the Rust version does very badly against even the slowest alternative.

Conversely when compiled on my x86-64 PC I get this:

Compiler.   Options.                        Seconds

clang                                       0.90

clang       -march=native -mtune=native     0.73

gcc                                         0.87

gcc         -march=native -mtune=native     0.90

rustc                                       0.62

Here we see the opposite, Rust handily outperforms all the competition. Which surprised me when I first saw it.

My original translation has no globals, it wraps everything into structs nicely. Another surprise being that doing that make almost no difference to performance.

Looks like I just have to accept that LLVM and hence Rust do not perform well on ARM?

How can I get those compiler options to LLVM using rustc or Cargo?

All timings made very crudely by using "time". But I have run these often enough to claim they are representative. The difference we are looking at are not small!

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.