Why is '*-linux-musl' much slower than '*-linux-gnu'?

I did some performance testing with my Rust application (on x86-64 Linux) and I found that target x86_64-unknown-linux-musl reduces the throughput of my application by a significant 25%, compared to building the exactly same code with x86_64-unknown-linux-gnu:

$ cargo run --release --target=x86_64-unknown-linux-gnu -- -T
Median execution time: 18.1 seconds (107.90 MiB/s)
$ cargo run --release --target=x86_64-unknown-linux-musl -- -T
Median execution time: 23.9 seconds (81.69 MiB/s)

(This was tested with Rust 1.91.1 on Ubuntu 25.10)

It is important to note that my application is definitely CPU bound. It does not perform much I/O. Actually, in the self-test mode, which I use for performance testing, it does not really perform any I/O at all. Also, I don't do any dynamic memory (de)allocation. All the buffers are allocated once, at the start, and will be de-allocated only at the end.

So how can just linking a different C library make such a big difference in performance? All the actual computations, which form the bottleneck for throughput, are done in pure Rust code!


I found this:

However, using MiMalloc did not make any difference with my application.

(Not much surprising, since, as said before, my application does not do a lot of allocations)


Any more ideas, what might be the culprit here?

Of course, I could simply go with x86_64-unknown-linux-gnu, but I much prefer using x86_64-unknown-linux-musl for the release binaries :thinking:

Best regards.

1 Like

You need to profile it, of course, but I wouldn't be surprised to find out that it's speed of memcpy, that's effecting it.

Quality of memcpy implementation is very important for traditional C/C++/Fortran CPU benchmarks, can easily change your code speed by 10% or so — and I wouldn't be surprised to find out that Rust is even more affected.

Try building with -C target-cpu=native or equivalent in RUSTFLAGS.

Dynamically-linked libraries can use GNU loader magic to choose versions of libc functions optimized for the CPU architecture, while in statically linked binary built for 2003 CPUs (Rust's default) you may get slower code.

6 Likes

The default allocator for musl is significantly worse than the default allocator in glibc. Try using mimalloc, scudo, or another high-performance allocator.

Past experience with this question has indicated it's always the allocator.

1 Like

I wanted to try on my machine[1], but there's something strange going on...

$ cargo run --release --target=x86_64-unknown-linux-gnu -- -T
Median execution time: 32.5 seconds (60.05 MiB/s)

$ RUSTFLAGS="-C target-cpu=native" cargo run --release --target=x86_64-unknown-linux-gnu -- -T
Median execution time: 27.1 seconds (72.11 MiB/s)

$ cargo run --release --target=x86_64-unknown-linux-musl -- -T
Segmentation fault         (core dumped)

$ RUSTFLAGS="-C target-cpu=native" cargo run --release --target=x86_64-unknown-linux-musl -- -T
Segmentation fault         (core dumped)

I just cloned your repo at commit e10f907f9afdcb4c1bff3efd852be7b3cd29642b.


  1. Intel Core Ultra 7 265H, Arch Linux, Rust 1.91.1, musl 1.2.5, glibc 2.42 ↩︎

I was under the impression that modern compilers will "inline" memcpy and the like. Specifically, with Rustc, wouldn't LLVM generate the code for these kinds of operations?

Anyhow, I tried using memx-cdy, but it didn't make any difference at all.

As said in my initial post, I already tried MiMalloc and it did not make any difference.

I would be really surprised, if the allocator made any difference, since I'm not doing any dynamic memory allocations. Everything is allocated at the start, once.

2 Likes

That is strange :thinking:

I have unit tests, and my GitHub CI will run them on Linux as well as on Windows and macOS.

All tests succeed with the latest revision.

They inline small memcpy and they automatically call out-of-line function for large objects.

Yes, if that can be done in less than 8 assembler register moves. For anything larger it would call memcpy. And optimized memcpy on contemporary x86 is huge to accomodate the use of various extensions. Inlining it would be a pessimisation.

2 Likes

AFAIK, the "GNU loader magic" no longer works with static linking.

However, adding -Ctarget-feature=+crt-static did not change anything about "✶-linux-gnu" being significantly faster than "✶-linux-musl" :thinking:

(I verified with ldd that the "✶-linux-gnu" build was indeed statically linked)

Thanks for pointing that out!

So, I did some more systematic testing with various -Ctarget-cpu options:

It turns out that certain -Ctarget-cpu settings dramatically improve the throughput! And, with those settings, the differences between "✶-linux-gnu" and "✶-linux-musl" mostly vanish :open_mouth:

Interestingly, "skylake" seems to give the best results, even though I have an AMD CPU (Zen 3).

3 Likes

I do not see any reason why there should be a "big" memcpy on my hot code path, though :confused:

(All computations are done with 128-Bit data blocks)

As already mentioned: profile! Use perf and look at a flame graph of each. I even remember seeing differential flamegraphs.

It is pointless to continue speculating instead of just doing the profiling.

1 Like

This solves the segmentation fault for me:

diff --git a/app/.cargo/config.toml b/app/.cargo/config.toml
index 03d76f0..00afd7c 100644
--- a/app/.cargo/config.toml
+++ b/app/.cargo/config.toml
@@ -8,8 +8,7 @@ rustflags = [ "-Dwarnings" ]
 rustflags = [ "-Dwarnings" ]
 
 [target.x86_64-unknown-linux-musl]
-linker = "musl-gcc"
-rustflags = [ "-Dwarnings", "-Ctarget-feature=+crt-static" ]
+rustflags = [ "-Dwarnings" ]
 
 [target.i686-unknown-linux-musl]
 linker = "musl-gcc"

Well, this has mostly been resolved by using the proper -Ctarget-cpu option.

Anyhow, today I did some profiling with the "default" case, i.e. without -Ctarget-cpu option.

flamegraph-gnu|690x0

flamegraph-musl|690x0

Now, what do we learn from this? :thinking:

...besides that it confirms that different memcpy() implementations seem to be in use.

I'm currently on my phone, and I don't seem to be able to zoom in enough to see any of the text in the flamegraphs.

One approach for comparing two runs is Differential Flame Graphs

1 Like

Flamegraphs only provide function-level granularity. You can drill down to instruction level details with perf report or kdab hotspot.

2 Likes