I'm just chiming in to report an observation with compiling Rust applications using both the glibc
(default) and musl
targets, as well as jemalloc
versus system-alloc
. In all the scenarios I've tried, musl-compiled Rust binaries are significantly faster than their glibc counterparts, so it's worth investigating this in your own projects. I've seen speedups ranging from 50% to 1000% faster.
I managed to drastically boost the performance of my Parallel application by switching to musl and ditching jemalloc
for the system allocator, reducing memory consumption and CPU cycles in half. I'm often times finding jemalloc
to be more of a nuisance to my performance in general, but it takes a nightly compiler to get rid of it, which is a bit silly. Anyway, here's some interesting performance metrics from my Linux box for a real-world application, Parallel. Benchmarks are ordered from slowest to fastest.
seq 1 10000 | perf stat target/release/parallel echo > /dev/null
seq 1 10000 | perf stat target/x86_64-unknown-linux-musl/release/parallel echo > /dev/null
Parallel with Jemalloc + Glibc (3368KB Max RSS)
10957.993473 task-clock:u (msec) # 1.673 CPUs utilized
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
1,487,686 page-faults:u # 0.136 M/sec
3,953,268,885 cycles:u # 0.361 GHz (85.95%)
0 stalled-cycles-frontend:u (85.68%)
0 stalled-cycles-backend:u # 0.00% backend cycles idle (85.21%)
1,451,526,963 instructions:u # 0.37 insn per cycle (84.79%)
325,691,822 branches:u # 29.722 M/sec (84.29%)
27,016,217 branch-misses:u # 8.30% of all branches (84.67%)
6.550195345 seconds time elapsed
Parallel with System Alloc + Glibc (3228KB Max RSS)
8813.700737 task-clock:u (msec) # 1.604 CPUs utilized
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
1,206,385 page-faults:u # 0.137 M/sec
3,227,954,900 cycles:u # 0.366 GHz (86.91%)
0 stalled-cycles-frontend:u (86.59%)
0 stalled-cycles-backend:u # 0.00% backend cycles idle (85.11%)
1,176,187,072 instructions:u # 0.36 insn per cycle (87.05%)
257,953,651 branches:u # 29.267 M/sec (88.35%)
25,232,814 branch-misses:u # 9.78% of all branches (86.67%)
5.494453770 seconds time elapsed
Parallel with Jemalloc + Musl (1768KB Max RSS)
7724.722519 task-clock:u (msec) # 1.594 CPUs utilized
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
1,210,474 page-faults:u # 0.157 M/sec
3,353,744,654 cycles:u # 0.434 GHz (88.37%)
0 stalled-cycles-frontend:u (88.27%)
0 stalled-cycles-backend:u # 0.00% backend cycles idle (87.71%)
1,323,967,181 instructions:u # 0.39 insn per cycle (85.28%)
281,211,163 branches:u # 36.404 M/sec (85.80%)
24,024,922 branch-misses:u # 8.54% of all branches (87.68%)
4.844914953 seconds time elapsed
Parallel with System Alloc + Musl (1768KB RSS Max)
4757.338202 task-clock:u (msec) # 1.329 CPUs utilized
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
757,191 page-faults:u # 0.159 M/sec
2,306,342,779 cycles:u # 0.485 GHz (90.86%)
0 stalled-cycles-frontend:u (92.85%)
0 stalled-cycles-backend:u # 0.00% backend cycles idle (89.90%)
1,150,291,731 instructions:u # 0.50 insn per cycle (91.26%)
233,091,489 branches:u # 48.996 M/sec (89.56%)
20,072,159 branch-misses:u # 8.61% of all branches (89.71%)
3.580532601 seconds time elapsed
I might add more benchmarks from other applications, but I'd be interested in seeing performance comparisons that anyone else might have with their applications.