Optimizing Rust Binaries: Observation of Musl versus Glibc and Jemalloc versus System Alloc


#1

I’m just chiming in to report an observation with compiling Rust applications using both the glibc (default) and musl targets, as well as jemalloc versus system-alloc. In all the scenarios I’ve tried, musl-compiled Rust binaries are significantly faster than their glibc counterparts, so it’s worth investigating this in your own projects. I’ve seen speedups ranging from 50% to 1000% faster.

I managed to drastically boost the performance of my Parallel application by switching to musl and ditching jemalloc for the system allocator, reducing memory consumption and CPU cycles in half. I’m often times finding jemalloc to be more of a nuisance to my performance in general, but it takes a nightly compiler to get rid of it, which is a bit silly. Anyway, here’s some interesting performance metrics from my Linux box for a real-world application, Parallel. Benchmarks are ordered from slowest to fastest.

seq 1 10000 | perf stat target/release/parallel echo > /dev/null
seq 1 10000 | perf stat target/x86_64-unknown-linux-musl/release/parallel echo > /dev/null

Parallel with Jemalloc + Glibc (3368KB Max RSS)

      10957.993473      task-clock:u (msec)       #    1.673 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
         1,487,686      page-faults:u             #    0.136 M/sec                  
     3,953,268,885      cycles:u                  #    0.361 GHz                      (85.95%)
                 0      stalled-cycles-frontend:u                                     (85.68%)
                 0      stalled-cycles-backend:u  #    0.00% backend cycles idle      (85.21%)
     1,451,526,963      instructions:u            #    0.37  insn per cycle                                              (84.79%)
       325,691,822      branches:u                #   29.722 M/sec                    (84.29%)
        27,016,217      branch-misses:u           #    8.30% of all branches          (84.67%)

       6.550195345 seconds time elapsed

Parallel with System Alloc + Glibc (3228KB Max RSS)

       8813.700737      task-clock:u (msec)       #    1.604 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
         1,206,385      page-faults:u             #    0.137 M/sec                  
     3,227,954,900      cycles:u                  #    0.366 GHz                      (86.91%)
                 0      stalled-cycles-frontend:u                                     (86.59%)
                 0      stalled-cycles-backend:u  #    0.00% backend cycles idle      (85.11%)
     1,176,187,072      instructions:u            #    0.36  insn per cycle                                              (87.05%)
       257,953,651      branches:u                #   29.267 M/sec                    (88.35%)
        25,232,814      branch-misses:u           #    9.78% of all branches          (86.67%)

       5.494453770 seconds time elapsed

Parallel with Jemalloc + Musl (1768KB Max RSS)

       7724.722519      task-clock:u (msec)       #    1.594 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
         1,210,474      page-faults:u             #    0.157 M/sec                  
     3,353,744,654      cycles:u                  #    0.434 GHz                      (88.37%)
                 0      stalled-cycles-frontend:u                                     (88.27%)
                 0      stalled-cycles-backend:u  #    0.00% backend cycles idle      (87.71%)
     1,323,967,181      instructions:u            #    0.39  insn per cycle                                              (85.28%)
       281,211,163      branches:u                #   36.404 M/sec                    (85.80%)
        24,024,922      branch-misses:u           #    8.54% of all branches          (87.68%)

       4.844914953 seconds time elapsed

Parallel with System Alloc + Musl (1768KB RSS Max)

       4757.338202      task-clock:u (msec)       #    1.329 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
           757,191      page-faults:u             #    0.159 M/sec                  
     2,306,342,779      cycles:u                  #    0.485 GHz                      (90.86%)
                 0      stalled-cycles-frontend:u                                     (92.85%)
                 0      stalled-cycles-backend:u  #    0.00% backend cycles idle      (89.90%)
     1,150,291,731      instructions:u            #    0.50  insn per cycle                                              (91.26%)
       233,091,489      branches:u                #   48.996 M/sec                    (89.56%)
        20,072,159      branch-misses:u           #    8.61% of all branches          (89.71%)

       3.580532601 seconds time elapsed

I might add more benchmarks from other applications, but I’d be interested in seeing performance comparisons that anyone else might have with their applications.


#2

What are the causes of the performance bost of the musl target compared to the glibc one?


#3

What are the causes of the performance boost of the musl target compared to the glibc one?

From what I gather, the GNU implementation of the C standard library is a complex beast of non-standard GNU extensions and inefficient/complex implementations. MUSL’s MIT implementation, on the other hand, aims for speed, simplicity, and strict adherence to the C standard. No GNU-specific extensions allowed. So put simply, musl uses much less code than glibc to do the same task, but may not be compatible with some GNU software that relies on quirky GNU behavior and extensions.


#4

There’s been some discussion about switching back to the system allocator by default, but nothing definitive. Numbers would help inform this discussion, for sure.

Yeah, it’s unfortunate, but there’s work to be done to stabilize custom allocator support before it can be fixed.


#5

Someone on reddit asked for a ripgrep data point, so I did a comparison across the full matrix for ripgrep using its benchmark suite, but I couldn’t find any meaningful differences. The benchmark command I ran was roughly:

cd clones/ripgrep/benchsuite
for libc in glibc musl; do
    for alloc in jemalloc system; do
        outdir=./runs/2016-12-24-archlinux-$libc-$alloc
        mkdir $outdir
        ./benchsuite --disabled ag,ucg,pt,sift,git,grep --dir /data/benchsuite/ --raw $outdir/raw.csv > $outdir/summary
    done
done

For glibc, the compile command I used was:

$ RUSTFLAGS="-C target-cpu=native" cargo build --release --features 'simd-accel avx-accel'

For musl:

$ RUSTFLAGS="-C target-cpu=native" cargo build --release --features 'simd-accel avx-accel' --target x86_64-unknown-linux-musl

Rust version:

$ rustc --version
rustc 1.15.0-nightly (71c06a56a 2016-12-18)

For the system allocator, I added #[feature(alloc_system)] and extern crate alloc_system; to my src/main.rs.

The results are in the corresponding directories here: https://github.com/BurntSushi/ripgrep/tree/master/benchsuite/runs — Just about every single benchmark is within one standard deviation of each other, and I didn’t see anything close to the perf differences witnessed in the OP.


With that out of the way, this is actually exactly what I expected to see. On average, ripgrep should do almost zero allocations for each file that it searches. Therefore, I wouldn’t really call ripgrep an “allocation heavy” workload, so it probably isn’t a good benchmark to use for this particular test.

(Actually, I lied when I said I got what I expected. The most interesting thing I learned from this exercise had nothing to do with allocators. What I learned was that MUSL’s memchr is seemingly competitive with glibc’s memchr. Cool.)


#6

I wrote Ayzim, a drop in Rust rewrite of the Emscripten asm.js optimizer (written in C++).

I have two files I use to benchmark its performance - sqlite.js (14MB) and unity.js (121MB). For this experiment I used musl with and without jemalloc (I never use glibc binaries so they aren’t of interest to me) on each test file, repeating three times. My command was /usr/bin/time ./ayzim-opt slow/$file asm eliminate simplifyExpressions simplifyIfs registerizeHarder minifyLocals asmLastOpts last >/dev/null (if you’re interested, those arguments are the the optimization passes that get run with emcc -O3.

sqlite.js with system malloc:

11.93user 0.10system 0:12.03elapsed 99%CPU (0avgtext+0avgdata 420256maxresident)k
0inputs+0outputs (0major+107500minor)pagefaults 0swaps
11.77user 0.10system 0:11.88elapsed 99%CPU (0avgtext+0avgdata 420348maxresident)k
0inputs+0outputs (0major+107646minor)pagefaults 0swaps
11.89user 0.12system 0:12.02elapsed 99%CPU (0avgtext+0avgdata 420364maxresident)k
0inputs+0outputs (0major+107208minor)pagefaults 0swaps

sqlite.js with jemalloc:

9.36user 0.14system 0:09.50elapsed 99%CPU (0avgtext+0avgdata 406640maxresident)k
0inputs+0outputs (0major+76237minor)pagefaults 0swaps
9.24user 0.14system 0:09.38elapsed 99%CPU (0avgtext+0avgdata 406640maxresident)k
0inputs+0outputs (0major+78849minor)pagefaults 0swaps
9.23user 0.12system 0:09.35elapsed 99%CPU (0avgtext+0avgdata 406684maxresident)k
0inputs+0outputs (0major+73624minor)pagefaults 0swaps

unity.js with system malloc:

59.52user 0.42system 0:59.94elapsed 99%CPU (0avgtext+0avgdata 1558608maxresident)k
0inputs+0outputs (0major+371050minor)pagefaults 0swaps
59.94user 0.39system 1:00.34elapsed 99%CPU (0avgtext+0avgdata 1558752maxresident)k
0inputs+0outputs (0major+371184minor)pagefaults 0swaps
59.87user 0.48system 1:00.35elapsed 99%CPU (0avgtext+0avgdata 1559440maxresident)k
0inputs+0outputs (0major+371530minor)pagefaults 0swaps

unity.js with jemalloc:

44.91user 0.30system 0:46.68elapsed 96%CPU (0avgtext+0avgdata 1147828maxresident)k
245936inputs+0outputs (0major+1067minor)pagefaults 0swaps
45.04user 0.19system 0:45.24elapsed 99%CPU (0avgtext+0avgdata 1147724maxresident)k
0inputs+0outputs (0major+1067minor)pagefaults 0swaps
44.93user 0.18system 0:45.11elapsed 99%CPU (0avgtext+0avgdata 1147728maxresident)k
0inputs+0outputs (0major+1066minor)pagefaults 0swaps

In terms of speed, no surprises for me. The original C++ optimizer deliberately leaked memory like a sieve because freeing memory was found to be slow. Ayzim should really use slab/arena allocation (many many tiny objects of the same type) but jemalloc puts the malloc/free calls way down in my callgrind profiles so it wouldn’t be much of a win. Needless to say, I wouldn’t be enthusiastic about making the system allocator the default.

Regarding memory, I am mildly surprised that jemalloc is so much better for unity.js. But ayzim is 4-6x better in terms of memory consumption than the C++ optimizer anyway, so measuing this was never really a focus.


#7

Here are the results for my uncbv project:

Jemalloc + Glibc

time ./target/release/uncbv x tests/others/MegaDatabase2016.cbv -o ~/CBVTests/ --no-confirm
75.49user 3.26system 1:24.75elapsed 92%CPU (0avgtext+0avgdata 971064maxresident)k
0inputs+3993304outputs (0major+15854minor)pagefaults 0swaps

System Alloc + Glibc

time ./target/release/uncbv x tests/others/MegaDatabase2016.cbv -o ~/CBVTests/ --no-confirm
80.45user 2.90system 1:29.13elapsed 93%CPU (0avgtext+0avgdata 971580maxresident)k
0inputs+3993304outputs (0major+19533minor)pagefaults 0swaps

Jemalloc + Musl

time ./target/x86_64-unknown-linux-musl/release/uncbv x tests/others/MegaDatabase2016.cbv -o ~/CBVTests/ --no-confirm
77.63user 2.62system 1:25.63elapsed 93%CPU (0avgtext+0avgdata 967984maxresident)k
0inputs+3993304outputs (0major+15783minor)pagefaults 0swaps

System Alloc + Musl

time ./target/x86_64-unknown-linux-musl/release/uncbv x tests/others/MegaDatabase2016.cbv -o ~/CBVTests/ --no-confirm
84.84user 3.16system 1:32.07elapsed 95%CPU (0avgtext+0avgdata 970204maxresident)k
0inputs+3993304outputs (0major+311665minor)pagefaults 0swaps

It seems the default is the best in my case.


#8

Here’s the issue for jettisoning jemalloc for those interested in helping: https://github.com/rust-lang/rust/issues/36963

It’s something the libs team wants, but work needs to be done. We need to have both a stable mechanism for selecting the global allocator as well as an api for implementing the allocator we can commit to.


#9

Is the musl binary statically linked? That would mean that fewer relocations are needed.


#10

Yes. The i686-unknown-linux-musl and x86_64-unknown-linux-musl triples currently hard-code for static linking.

(I use the former as an easy way to use Rust to produce replacements for shell scripts with much better compile-time guarantees which will still run on any x86-compatible Linux I throw them at.)