How to profile a Rust binary?

Hi

In this issue we are discussing if we should reduce the buffer limit for Rustls streams from the default 64kb to 16kb.

I am using Apple Instruments to profile if there actually is a performance improvement when reducing it to 16kb. I have used Blackfire to profile PHP projects before, but I am not sure how to analyze these profiling runs for Rust projects.

What I did was a cargo build --release multiple times for both a buffer of 64kb and 16kb. Then I did multiple runs in Apple Instruments.

For e.g Allocations:

For both 16kb and 64kb I had similar amount of total bytes.

Also the total execution time did not really show any significant difference between the two options:

The Network Connection data sadly is always empty, in a few runs it shows data for other processes totally not related to the binary I'm profiling:

So I am wondering how I would actually profile a Rust binary?

  1. The difference between 64kb and 16kb probably would be subtle, but how to even detect subtle performance improvements?
  2. Each run has different numbers effected by many external factors, how to still derive conclusions despite this?
  3. Other tips on how to profile a Rust binary?

Profiling is for finding the cause of poor performance. When you have two implementations and you want to pick one that performs better, you need to do benchmarking, not profiling.

2 Likes

Thanks for your reply, I tried benchmarking using criterion and cargo bench. But the results are still hard to interpret. Running the same benchmark multiple times (without change in buffer size) yields results from 150ms runtime to 350ms runtime.

IG there are too many different components influencing the runtime making it too hard to isolate the performance difference regarding a smaller buffer size.

You can adjust criterionโ€™s parameters โ€” in particular the measurement_time or sample_size โ€” to reduce noise.

1 Like
  1. The difference between 64kb and 16kb probably would be subtle, but how to even detect subtle performance improvements?

You have a science experiment, where the data is noisy. You want to determine whether a change produces a statistically significant result. The general process is to choose an experiment (the benchmark), run the experiment N times with the baseline version and N times with the changed version and calculate how likely it is that you see the results for the changed version by chance. You can do this for example using a T test, where you are looking for a result with a small p value.

The more variance in results from experiment runs (eg. due to a "noisy" environment), the bigger a change will be needed to "prove" a beneficial impact, in the sense that it is sufficiently unlikely to have happened by chance.

As for how to deal with noise, you can:

  • Try to devise ways to run the code in a "quieter" environment (eg. a system or virtual machine with less activity happening in it). A server running in a well-isolated cloud environment might work. The simplest version of this is to just close all other running programs, especially large ones like Chrome/Electron, Docker etc.
  • Reduce the scope of the system under test. Often a micro-benchmark is devised which tests just a changed function.
  • Make the experiment itself more extreme (larger inputs, files etc.) to amplify any differences in behavior
  • Run the experiment more times to increase your sample size.
2 Likes

There are many reasons for this. In addition to what @robertknight mentioned, your CPU might be running at varying clock frequency depending on how hot it is, or how long the hardware is allowed to run at max turbo. This is especially true on laptops, but even desktops will do this.

When benchmarking, I recommend disabling power saving features but also disabling turbo frequency. How to do this will vary between OSes as well as CPU vendor (e.g. AMD vs Intel). I have some scripts for this on x86-64 Linux, but you mentioned MacOS in your post, and I have no clue about how to do such things there, you would have to do your own research.

(None of that applies to virtual machines of course, if you are CPU bound you really need to do this on real hardware to get useful values.)

EDIT: The Linux script for anyone who is interested: Script to make system more predictable during benchmarking ยท GitHub

3 Likes

How to do this will vary between OSes as well as CPU vendor (e.g. AMD vs Intel). I have some scripts for this on x86-64 Linux, but you mentioned MacOS in your post, and I have no clue about how to do such things there, you would have to do your own research.

On my final Intel Mac I found that variability between benchmark runs was sometimes significant and I don't know of an "easy" way to disable frequency scaling there, certainly not compared to Linux systems. The main mitigations I used were keeping the system generally quiet and allowing for a longer warmup time in benchmarks or collecting more samples.

On my M3 Mac however (5 P cores, 6 E cores) I've found this is much less of a problem, with benchmark runs often being very consistent.