Making bechmarking iterations visible for profiling

I am currently trying to minimize the overhead of wrapping computations under a unified interface in Collenchyma.
For that I have set up some benchmark tests and started profiling them with perf (this nice blog post led me to that).

One thing I found really cumbersome was seperating the benchmark setup from the benchmarking iterations (which I want to optimize) in the perf output. Currently I am using a hacky solution like this:

fn bench_1000_dot_100_collenchyma(b: &mut Bencher) {
    let mut rng = thread_rng();
    let slice_a = rng.gen_iter::<f32>().take(100).collect::<Vec<f32>>();
    let slice_b = rng.gen_iter::<f32>().take(100).collect::<Vec<f32>>();

    let backend = backend();
    let shared_a = &mut SharedMemory::<f32>::new(backend.device(), 100);
    let shared_b = &mut SharedMemory::<f32>::new(backend.device(), 100);
    let shared_res = &mut SharedMemory::<f32>::new(backend.device(), 100);
    let _ =, shared_b, shared_res);
    bench_1000_dot_100_collenchyma_profile(b, &backend, shared_a, shared_b, shared_res);

fn bench_1000_dot_100_collenchyma_profile(b: &mut Bencher, backend: &Backend<Native>, shared_a: &mut SharedMemory<f32>, shared_b: &mut SharedMemory<f32>, shared_res: &mut SharedMemory<f32>) {
    b.iter(|| {
        for _ in 0..1000 {
            let _ =, shared_b, shared_res);

Now I can focus on bench_1000_dot_100_collenchyma_profile which should only contain the code I want to optimize.
Is there any better solution for that?
I'd also like to know if it is possible to only profile for stacks on top of a stack with a certain symbol?

Any pointers to some better solutions for profiling/optimizing (Rust) applications are appreciated!

Generally, I wouldn't rely on #[bench] too much. I'm sorry that I cannot give better suggestions, but the current test framework isn't really evolving or on the way to stability.

Things like look promising, though.

I use #[bench] quite frequently with some success. Splitting out the iterations into a separate function is not a bad idea. One issue that I've run into is that the #[bench] function itself is actually invoked many times, which means the setup code is invoked many times. If the setup code is expensive relative to the code you're benchmarking (which is frequently the case for me), then benchmarks can take a long time to complete and any profile you get out of running the benchmarks will be heavily skewed towards the setup code. It's not impossible to read such profiles, but it is not ideal. (I don't actually know why the benchmarking harness runs the setup code more than once.)

You can work around this pretty easily by ensuring your setup code only runs once. The lazy_static crate makes it pretty easy. Here's an example:

I didn't know about lazy_static until earlier today, and that sounds like a good use for it.

To make the output more focused I now found another way. Before viewing the perf data/constructing a flamegraph out of it, the output has to be "folded". After that step the output can easily be filtered with grep for the function name, only leaving the relevant parts behind.