In this issue we are discussing if we should reduce the buffer limit for Rustls streams from the default 64kb to 16kb.
I am using Apple Instruments to profile if there actually is a performance improvement when reducing it to 16kb. I have used Blackfire to profile PHP projects before, but I am not sure how to analyze these profiling runs for Rust projects.
What I did was a cargo build --release multiple times for both a buffer of 64kb and 16kb. Then I did multiple runs in Apple Instruments.
Profiling is for finding the cause of poor performance. When you have two implementations and you want to pick one that performs better, you need to do benchmarking, not profiling.
Thanks for your reply, I tried benchmarking using criterion and cargo bench. But the results are still hard to interpret. Running the same benchmark multiple times (without change in buffer size) yields results from 150ms runtime to 350ms runtime.
IG there are too many different components influencing the runtime making it too hard to isolate the performance difference regarding a smaller buffer size.
The difference between 64kb and 16kb probably would be subtle, but how to even detect subtle performance improvements?
You have a science experiment, where the data is noisy. You want to determine whether a change produces a statistically significant result. The general process is to choose an experiment (the benchmark), run the experiment N times with the baseline version and N times with the changed version and calculate how likely it is that you see the results for the changed version by chance. You can do this for example using a T test, where you are looking for a result with a small p value.
The more variance in results from experiment runs (eg. due to a "noisy" environment), the bigger a change will be needed to "prove" a beneficial impact, in the sense that it is sufficiently unlikely to have happened by chance.
As for how to deal with noise, you can:
Try to devise ways to run the code in a "quieter" environment (eg. a system or virtual machine with less activity happening in it). A server running in a well-isolated cloud environment might work. The simplest version of this is to just close all other running programs, especially large ones like Chrome/Electron, Docker etc.
Reduce the scope of the system under test. Often a micro-benchmark is devised which tests just a changed function.
Make the experiment itself more extreme (larger inputs, files etc.) to amplify any differences in behavior
Run the experiment more times to increase your sample size.
There are many reasons for this. In addition to what @robertknight mentioned, your CPU might be running at varying clock frequency depending on how hot it is, or how long the hardware is allowed to run at max turbo. This is especially true on laptops, but even desktops will do this.
When benchmarking, I recommend disabling power saving features but also disabling turbo frequency. How to do this will vary between OSes as well as CPU vendor (e.g. AMD vs Intel). I have some scripts for this on x86-64 Linux, but you mentioned MacOS in your post, and I have no clue about how to do such things there, you would have to do your own research.
(None of that applies to virtual machines of course, if you are CPU bound you really need to do this on real hardware to get useful values.)
How to do this will vary between OSes as well as CPU vendor (e.g. AMD vs Intel). I have some scripts for this on x86-64 Linux, but you mentioned MacOS in your post, and I have no clue about how to do such things there, you would have to do your own research.
On my final Intel Mac I found that variability between benchmark runs was sometimes significant and I don't know of an "easy" way to disable frequency scaling there, certainly not compared to Linux systems. The main mitigations I used were keeping the system generally quiet and allowing for a longer warmup time in benchmarks or collecting more samples.
On my M3 Mac however (5 P cores, 6 E cores) I've found this is much less of a problem, with benchmark runs often being very consistent.