TLDR
Making a closure only once in main
rather than one million times in a function inside a loop, slows my code down by 16%.
Can you help me make sense of it?
Details
I (and this is likely to be because of my shortcomings, rather than shortcomings of these tools) can't find any meaningful signal by looking at the output of perf or kcachegrind.
Looking at before/after flamegraphs I can only spot two differences
-
The appearance of a stack that doesn't even include
main
! But this contribution is small compared to the second one. -
Some other closure:
move |(index, weight)| weight * blob[index]
seemingly unrelated to anything that has changed
-
seems to burn up twice as much time as before
-
before, it spent 75% of its time in the indexing op; now it only spends 50% of its time there.
-
I didn't expect this change to speed my code up (only to simplify it), but I certainly didn't expect it to slow the code down so noticeably!
Questions
Does this sort of thing ring any bells with you?
How can I look at the assembly generated by this closure (and compare the before/after versions) which is embedded deep within my code? I don't know how to find it in perf or kcachegrind.
Even more details
Description of the code
I have tried to summarize the pertinent structure of my code in words. Corresponding pseudocode is below.
-
main
passes a huge datasetdata
and two trivial valuesa
andb
intoaaa
-
These three things get passed down a chain of functions (
bbb
,ccc
) -
ccc
loops over the data callinggenerate_stuff
-
generate_stuff
(depending on a CLI switch) either usesa
andb
to make a closure, or uses the identity closure instead -
the stuff generated by
genarate_stuff
is used to calculate a single value -
calculate_value
iterates overstuff
mapping a simple closure: the one that changes in the flamegraph -
(afterwards, there is a further iteration over
stuff
which is why the stuff was collected into a vector)
The change consists of creating the closure in main
and passing the closure up the stack, rather than passing a
and b
up the stack and creating the closure repeatedly inside the 'loop'.
NB:
-
It's not the closure whose creation point is changed, which slows down: it's the one inside
calculate_value
. -
The timings are performed in the mode where the closure made from
a
andb
is not being used at all!
Outline of the code
fn main{} {
let a = ...;
let b = ...;
let data = ...; // typical size: 1 million
let result = aaa(&data, a, b);
}
fn aaa(..., data, a, b) { bbb(..., data, a, b) }
fn bbb(..., data, a, b) { ccc(..., data, a, b) }
fn ccc(..., data, a, b) {
let blob = ...;
data.iter().for_each(|datum|) {
let stuff = generate_stuff(..., a, b).collect(); // typical size: 60
let value = calculate_value(stuff.iter().copied());
// mutate blob using stuff and value
...
}
blob
}
fn generate_stuff<'a>(..., a, b) -> impl Iterator<Item = (Index, Weight)> + 'a {
// The timings are performed with the CLI option set *NOT* to use the closure
let closure = match cli_option {
Some(sigma) => make_closure(a,b),
None => Box::new(|x| x),
}
make_stuff(...).map(closure)
}
fn calculate_value(&blob: Blob, stuff: impl Iterator<(Index, X>)) -> Weight {
stuff
// *THIS* is the closure which seems to slow down
.map(move |(index, weight)| weight * blob[index])
.sum()
}