As always in performance analysis, the answer depends on what exactly you want to know (for example, exact "number of calls to a function" is a very expensive quantity to measure, so most profilers prefer to give you an approximate "percent of time spent in the function" instead).
But in my opinion, the best "default choice" these days is a hardware-assisted sampling profiler. These tools are either operating system or hardware-specific, here are some references that you can look up:
- Linux: perf (aka perf_events)
- Windows: WPA (or its xperf(view) ancestor)
- macOS: XCode Instruments
- Intel: VTune Amplifier
- AMD: CodeXL
- NVidia: NSight
Personally, I spend most of my time using Linux these days, so perf is what I am most proficient with. I have written a small tutorial about perf + rust on this forum a while ago @ Profilers and how to interpret results on recursive functions - #2 by HadrienG .
One thing which I have learned since I wrote this tutorial is that --call-graph=dwarf
is a better default than --call-graph=lbr
, because LBR only works if your hardware is recent enough and your application call chains are not too deep (~16 frames or less). I would now recommend starting with DWARF, then trying with LBR, checking that the LBR mode works and that the results are similar, and only in that case using LBR.