Binary trees benchmark


Do you know why the simple “binary trees” benchmark in Rust is almost two times slower than the C solution? And how to improve this situation?


I wonder how much this for loop costs at the end:

for message in messages {
println!("{}", message);

That flushes on each printing, right?


I just tried running the benchmarks on my computer. Windows machine, Intel 6400K CPU, running programs under Bash4Win.


time ./binarytrees.gpp-9 21
real    0m3.679s
user    0m4.109s
sys     0m3.922s

time target/release/binarytrees.rust-2 21:

real    0m2.075s
user    0m5.406s
sys     0m8.594s

So the Rust benchmark runs faster on my machine, which means that we might just be in the zone where system peculiarities dominate. But note that it’s working 5/4ths as hard to do so, and more importantly, it’s spending over twice as much time in the kernel! So we definitely have room to optimize.

So let’s check where this time is coming from:

% strace -r target/release/binarytrees.rust-2 21 |& sort -rn | head -n1
    1.910483 futex(0x7f3884e1ef00, FUTEX_WAKE_PRIVATE, 1) = 0

Meanwhile in C++:

% strace -r binarytrees.gpp-9 21 |& sort -rn | head -n1
    0.357550 munmap(0x7f49f0c40000, 8192) = 0

So that’s your answer. I believe the Rayon crate is locking, while OpenMP is not. That’s kind of interesting, because this is an embarassingly parallel problem, where locks should generally not be necessary.

Let’s check our work by turning parallelism off. I replaced into_par_iter with into_iter in the rust example, and deleted #pragma omp parallel for from the C++ example.


real    0m5.974s
user    0m2.875s
sys     0m3.078s


real    0m3.719s
user    0m3.266s
sys     0m0.453s

Now Rust is much faster on my machine, and it’s spending almost no time hitting the kernel. It seems that locks are the entirety of the problem here.


It’s using a shared arena to allocate nodes from parallel iterations. That’s surely contending a lot!

edit: nevermind, the arenas are not shared, I read it wrong…


Unfortunately I am still underprepared for this, perhaps Niko Matsakis could fix or improve the Rayon situation.


I think it’s a mistake to judge performance under Bash4Win. My i7-7700K with native Fedora 26 gets:

real    0m0.775s
user    0m5.013s
sys     0m0.092s

Not much system time at all!

Also note that strace doesn’t follow threads by default. strace -c only gives me this:

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 88.09    0.055584       11117         5           futex
  8.89    0.005610         374        15           munmap
  1.60    0.001011          42        24           brk
  0.36    0.000230           6        39           mmap
  0.21    0.000135          19         7           open

But strace -f -c gives:

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 80.70    0.048566         357       136           futex
  8.76    0.005274          98        54           munmap
  6.77    0.004072           0     19428           sched_yield
  1.73    0.001040          43        24           brk
  0.83    0.000499           6        87           mmap

One more thing I realized is that I took these measurements with Fedora’s rustc, which is set to prefer the system allocator. When I run with the upstream stable rustc that uses jemalloc, I get:

real    0m0.813s
user    0m4.729s
sys     0m0.735s

i.e. much more sys time, and that is shown in strace too:

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 76.36    0.885845        5369       165         3 futex
 15.17    0.176029          46      3861           madvise
  8.20    0.095092           2     63341           sched_yield
  0.19    0.002229          16       140           mmap
  0.02    0.000210           3        68           munmap

So IMO blame jemalloc, not Rayon. :slight_smile:

We might tweak Rayon’s sleeping algorithm though, which you see in that heap of sched_yield calls as threads spin down lacking work. But it’s also likely this is being perturbed by the overhead of strace itself, which adds delays for every PTRACE_SYSCALL stop.