Removing println!() slows down benchmark +10%?


#1

I have a library crate that has some println!() scattered around the place, so I thought I should replace these with warn!() instead.

The benchmark performance got a bit worse with that change (even though the benchmark doesn’t actually call println!()), so to try and work out what was going on, I tried various things to try and minimise the change, and eventually found that just removing a particular usage of println!() was enough to trigger the performance reduction.

How can this be?
Have I stared for too long at the output of perf report? Is this the compiler’s revenge for all the #[inline] I added? :dragon:

parse                   time:   [3.1432 ms 3.1586 ms 3.1753 ms]                   
                        thrpt:  [11.028 GiB/s 11.087 GiB/s 11.141 GiB/s]
                        change: [+9.4721% +12.055% +14.107%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
  6 (6.00%) high mild
  2 (2.00%) high severe

I can reliably see the performance degrade with this change in place, and then see it return to previous levels if I revert the change.

My code is at dholroyd/mpeg2ts-reader, and the change which degrades performance is,

diff --git a/src/pes.rs b/src/pes.rs
index 4058f70..7b4e918 100644
--- a/src/pes.rs
+++ b/src/pes.rs
@@ -109,7 +109,6 @@ where
                     }
                 },
                 PesState::Begin => {
-                    println!("{:?}: Ignoring elementary stream content without a payload_start_indicator", packet.pid());
                     self.state = PesState::IgnoreRest;
                 },
                 PesState::IgnoreRest => ()

Any directions I could look in to try and work this out? :exploding_head:

$ cargo --version
cargo 1.30.0 (36d96825d 2018-10-24)

#2

Clearly the compiler is quite attached to that printout :slight_smile:.

This type of thing can (and does) happen for a variety of reasons, particularly because println! expands to a bunch of code and also increases stack usage quite a bit:

  • binary layout is now different due to less code, and you can get unfortunate placement of the code pages in the cache
  • removing all that code changes stack usage, impacts register allocator, and so on - again, can be unfortunate
  • not relevant to your particular example, but even which cwd you’re in, which environment variables are present, etc can impact a binary’s performance, all due to memory layout of data and code

The only way to get more insight on this is to run your bench under something like Linux’s perf, and compare the pre and post change output.

Why did you add #inline(always) to consume()? At a glance, it has a non-trivial body, and typically you’d only force inlining for trivial functions.

You can play around with the code, such as stripping off #[inline(always)] from places that, perhaps, ought to not have them. If you have cold/uncommon code paths in functions, you can outline them into a separate function that’s annotated with #[cold].

I’d also try running your bench on a different machine (preferably different CPU model), if you’ve access to one.


#3

Thanks for the generous advice!

I’ve been trying to understand this via perf's annotated output, but my lack of assembly knowledge makes it slow going (even with the rust source interleaved). More learning for me to do!

FWIW, I’m not sure why I put #[inline(always)] rather than just #[inline] on consume(), but the lack of either annotation reduces microbenchmark throughout by ~20%!

Since I’ve put #[inline] all over the place, most of the benchmark runtime is now spent in a single stack-frame, in perf report. Trying to remove cold code from that stack frame is a great idea that I will definitely try – thanks!

I will also try to start look at performance differentials on an RPI too, sounds like a great way to avoid ‘over-fitting’ the code to the particular benchmark conditions.