I am currently trying to minimize the overhead of wrapping computations under a unified interface in Collenchyma.
For that I have set up some benchmark tests and started profiling them with perf (this nice blog post led me to that).
One thing I found really cumbersome was seperating the benchmark setup from the benchmarking iterations (which I want to optimize) in the perf output. Currently I am using a hacky solution like this:
#[bench]
fn bench_1000_dot_100_collenchyma(b: &mut Bencher) {
let mut rng = thread_rng();
let slice_a = rng.gen_iter::<f32>().take(100).collect::<Vec<f32>>();
let slice_b = rng.gen_iter::<f32>().take(100).collect::<Vec<f32>>();
let backend = backend();
let shared_a = &mut SharedMemory::<f32>::new(backend.device(), 100);
let shared_b = &mut SharedMemory::<f32>::new(backend.device(), 100);
let shared_res = &mut SharedMemory::<f32>::new(backend.device(), 100);
shared_a.get_mut(backend.device()).unwrap().as_mut_native().unwrap().as_mut_slice().clone_from_slice(&slice_a);
shared_b.get_mut(backend.device()).unwrap().as_mut_native().unwrap().as_mut_slice().clone_from_slice(&slice_b);
let _ = backend.dot(shared_a, shared_b, shared_res);
bench_1000_dot_100_collenchyma_profile(b, &backend, shared_a, shared_b, shared_res);
}
#[inline(never)]
fn bench_1000_dot_100_collenchyma_profile(b: &mut Bencher, backend: &Backend<Native>, shared_a: &mut SharedMemory<f32>, shared_b: &mut SharedMemory<f32>, shared_res: &mut SharedMemory<f32>) {
b.iter(|| {
for _ in 0..1000 {
let _ = backend.dot(shared_a, shared_b, shared_res);
}
});
}
Now I can focus on bench_1000_dot_100_collenchyma_profile
which should only contain the code I want to optimize.
Is there any better solution for that?
I'd also like to know if it is possible to only profile for stacks on top of a stack with a certain symbol?
Any pointers to some better solutions for profiling/optimizing (Rust) applications are appreciated!