Investigation: Rust emits ~38x the assembly compared to C

Thought you guys might find this interesting. If not, my apologies :slight_smile:

While playing around with compiler explorer, I investigated how the generated assembly differs between GCC and rustc. Going into this I expected the Rust binary to be bigger but still comparable. To test this, I wrote a simple prime gen in both C and Rust, both essentially identical except for using calloc in C and vec::with_capacity() in Rust. As expected, C compiled down to 97 lines of assembly, but what caught me off guard was Rust's 3674 lines. Let's investigate.

Changing the Vec<u64> to an [0u64; 10_000_000] only got rid of 513 lines, bringing the total down to 3161. Hmm. Let's dig deeper. Could it be the panic unwinding? Changing the panic type to abort ("-C panic=abort") gets rid of 323 additional lines, but we're still at 2838 lines compared to C's 97. Turning to Google, I found others saying that LLVM's loop unrolling is much more aggressive than that of C. I therefore replaced GCC with clang, thinking that it'd at least increase slightly. Nope. I instead found myself staring at 87 lines for C.

EDIT:
Link: Compiler Explorer
Speed wise, both are very performant, Rust being about 1.1x the speed of C, so it seems that a lot of the assembly isn't used? Could it be out of bounds checking and the like?

I would investigate further, however, as for right now I'm out of time. I'm sure I'm missing something trivial. Maybe the #[no_mangle] attribute disallows for optimizations that GCC/clang applies? If anyone knows, do let me know! Cheers :slight_smile:

You should link to your code on compiler explorer. Details like compiler flags or how you call functions might matter.

7 Likes

Mandatory question, did you compile with -Copt-level=3? Also, you can share your code from the compiler explorer, which would help us a lot in understanding what is going on. There is a button "Share" in the top right corner.

7 Likes

Fixed, sorry about that. Shouldn't have posted in a rush(t)! :stuck_out_tongue:
@jofas no flags added in compiler explorer.

Well there's a huge part of your problem.

Your godbolt has 2838 lines; add -O and it's down to 673.

8 Likes

That sounds more reasonable. Is this flag handled by cargo when building for release?

Yes. If you want to inspect what --release does in more detail you can run cargo with -v (repeat for more verbose output): cargo -vvv build --release.

(Specifically, it passes -C opt-level=3 to rustc).

3 Likes

I don't see Vec::with_capacity in the Rust version on godbolt.

Other differences are:

  • the Rust code copies and verifies that all CLI argument are UTF-8;
  • the Rust code doesn't have UB if the CLI argument is not a representable number;
  • the Rust code includes the code used for parsing and printing (C dynamically links to libc, so the code is not shown on Godbolt);
  • the Rust code is performing the square root going through floats (there's u64::isqrt but it's unstable).

If you enable optimizations with -O, remove the initial step for obtaining primes_max and replace the square root with the unstable isqrt then the Rust version goes down to 128 lines. https://godbolt.org/z/hnxn6f4sn

25 Likes

For more detail, cargo build —release activates Cargo’s release profile, one of the four predefined profiles: dev, release, test, and bench. The Cargo book will tell you what exact configuration each profile maps to by default. You can customize the profiles and also define your own.

2 Likes

This is not a meaningful comparison. as @SkiFire13 show, you are comparing the code that collects these parameters, but apparently, there are still more unfairness between the two pieces of code:

1.

    if std::env::args().len() != 2 {
        panic!("Usage: primes <int>");
    }

vs

  if(argc!=2) {
    printf("specify max count");
    exit(EXIT_FAILURE);
  }

more fair:

    if std::env::args().len() != 2 {
        println!("specify max count");
        return std::process::ExitCode::FAILURE;
    }

2.

let primes_max = usize::from_str_radix(&std::env::args().collect::<Vec<String>>()[1], 10).expect("Usage: primes <int>");

vs

    int primes_max = atoi(argv[1]);

more fair (int and usize are different, but this is a problem on the C side):

    let primes_max = str::parse::<usize>(&std::env::args().nth(1).unwrap()).unwrap();

3.

let mut found_primes = [0u64; 10000000];

vs

int* found_primes = calloc(primes_max, sizeof(int));
// free??

In the C code, I didn't even see the free ...

more fair (I just think it match the C version better):

let mut found_primes = Vec::with_capacity(primes_max as usize);

However, you would never write Rust in the same way as you write C, would you?

4 Likes

Indeed, there are multiple things I completely missed to take into account. As for the missing free() and Vec::with_capacity(), those were present when I began testing but were subsequently removed as I tried to narrow down the root cause. Also, I had no idea about the str::parse function! Thanks, you learn something new every day.

On another note, I think it's clear that I need to deepen my knowledge of Rust - just performing array[1] in C has many alternatives in Rust. [1], .nth(1), etc! Are there any good resources out there? I have already read the official book.

TL;DR: maybe actually do your research before posting (unlike me :laughing:) I still find it fascinating how Rust and C provide such vastly different results for very naive beginner approach by a hobbyist like me! Thanks for chiming in everybody, the more you know.

1 Like

(This is not a disagreement, just a tangent.)

It's totally reasonable, for a program that does one thing and exits, to skip all the work of freeing heap allocations just before the process is destroyed anyway by the operating system. If you want to do that in Rust on purpose, you can explicitly forget things so they are not dropped:

fn main() {
    let mut found_primes = Vec::with_capacity(...);
    ...
    std::mem::forget(found_primes);
}

You can also insert an early std::process::exit() to skip all drops that returning from main() would normally do. Neither of these things will actually eliminate the deallocation code, because it is still needed for cleanup during unwinding, but if you also disable unwinding with -C panic=abort, then the deallocation will be gone.

4 Likes

I don't think there's any problem in your code, it would be too boring if there was only one answer. I just found in godbolt that these modifications generate less assembly.

.nth(1) is commonly referred to as functional style, which is one of Rust's features.

One of the fastest ways to learn Rust is by contributing to open-source projects. You can try replacing some of the tools you use daily with Rust alternatives, such as git (jj, git-branchless), bazel (buck2), grep (ripgrep)... Discover their shortcomings and make improvements. You will learn a lot of new things in the process.

You can also write small tools in Rust for your own use, for example I once wrote an updater tool to update nightly versions of Zig, Blender on Windows.

2 Likes