Performance conundrum

TLDR

Making a closure only once in main rather than one million times in a function inside a loop, slows my code down by 16%.

Can you help me make sense of it?

Details

I (and this is likely to be because of my shortcomings, rather than shortcomings of these tools) can't find any meaningful signal by looking at the output of perf or kcachegrind.

Looking at before/after flamegraphs I can only spot two differences

  1. The appearance of a stack that doesn't even include main! But this contribution is small compared to the second one.

  2. Some other closure:

    move |(index, weight)| weight * blob[index]

    seemingly unrelated to anything that has changed

    • seems to burn up twice as much time as before

    • before, it spent 75% of its time in the indexing op; now it only spends 50% of its time there.

I didn't expect this change to speed my code up (only to simplify it), but I certainly didn't expect it to slow the code down so noticeably!

Questions

Does this sort of thing ring any bells with you?

How can I look at the assembly generated by this closure (and compare the before/after versions) which is embedded deep within my code? I don't know how to find it in perf or kcachegrind.

Even more details

Description of the code

I have tried to summarize the pertinent structure of my code in words. Corresponding pseudocode is below.

  • main passes a huge dataset data and two trivial values a and b into aaa

  • These three things get passed down a chain of functions (bbb, ccc)

  • ccc loops over the data calling generate_stuff

  • generate_stuff (depending on a CLI switch) either uses a and b to make a closure, or uses the identity closure instead

  • the stuff generated by genarate_stuff is used to calculate a single value

  • calculate_value iterates over stuff mapping a simple closure: the one that changes in the flamegraph

  • (afterwards, there is a further iteration over stuff which is why the stuff was collected into a vector)

The change consists of creating the closure in main and passing the closure up the stack, rather than passing a and b up the stack and creating the closure repeatedly inside the 'loop'.

NB:

  1. It's not the closure whose creation point is changed, which slows down: it's the one inside calculate_value.

  2. The timings are performed in the mode where the closure made from a and b is not being used at all!

Outline of the code

fn main{} {
   let a = ...;
   let b = ...;
   let data = ...; // typical size: 1 million
   let result = aaa(&data, a, b);
}

fn aaa(..., data, a, b) { bbb(..., data, a, b) }
fn bbb(..., data, a, b) { ccc(..., data, a, b) }
fn ccc(..., data, a, b) {
   let blob = ...;
   data.iter().for_each(|datum|) {
       let stuff = generate_stuff(..., a, b).collect(); // typical size: 60
       let value = calculate_value(stuff.iter().copied());
       // mutate blob using stuff and value
       ...
     }
   blob
}

fn generate_stuff<'a>(..., a, b) -> impl Iterator<Item = (Index, Weight)> + 'a {
   // The timings are performed with the CLI option set *NOT* to use the closure
   let closure = match cli_option {
       Some(sigma) => make_closure(a,b),
       None        => Box::new(|x| x),
   }
   make_stuff(...).map(closure)
}

fn calculate_value(&blob: Blob, stuff: impl Iterator<(Index, X>)) -> Weight {
   stuff
       // *THIS* is the closure which seems to slow down
       .map(move |(index, weight)| weight * blob[index])
       .sum()
}

I'm guessing that the compiler is optimizing the Some(sigma) case in this version so that the closure gets inlined. But when you pass in a closure from outside the function, that's not an option, so you get a dynamic dispatch penalty.

3 Likes

I think you've over-condensed your code. It'd be easier to guess why performance decreased if your code had the full function signatures with types included, and you showed us the code both with and without the change that decreased performance.

As for perf and cachegrind, you'll want debug symbols enabled. I assume you're profiling in release mode as there's no point to profiling in dev mode. You can either enable optimizations in the dev profile or enable debug symbols in the release profile:

[profile.dev]
#debug = true # This is the default for dev profiles
opt-level = 3

or

[profile.release]
debug = true
# opt-level = 3 # This is the default for dev profiles

Source code output should be present if you compiled with debug info.
To get assembly output in cachegrind, you need to specify specific flags when recording the profile. You'll want something like

$ valgrind --tool=callgrind --dump-instr=yes --collect-jumps=yes target/debug/foo 

To get assembly output in perf, all you have to do is hover over the symbol and then press 'a' (for 'annotate'). While viewing the annotated symbol, you can press 's' to "Toggle source code view" and 'k' to toggle line numbers for source code.
Press '?' to view help when in the main menu and 'h' to view help when viewing annotated symbols.

Right, except that all these timings are performed with the CLI option set to None, so the Some(sigma) case shouldn't be relevant at all. This is the closure whose creation has been moved, but this is NOT the closure which appears to slow down. This closure here is never called!

Remember that making a closure is essentially free. That's basically no value to keeping one around. You might be interested in https://rustyyato.github.io/rust/syntactic/sugar/2019/01/17/Closures-Magic-Functions.html

My psychic debugging guess here: whatever the closure is using, making a new one every time means that any borrows involved are very short and easily removed, but making one at a high level means there's a long-lived borrow that the compiler can't get rid of, and thus there's extra indirection induced.

I'm remembering that other example where println!("{}", {n}); made the algorithm calculating n way faster than it was with println!("{}", n); because of the copy-vs-borrow difference.

3 Likes

Yes, I feared that that might be the case. But there's just too much of it, and it's too much of a mess. I'll see whether I can come up with something more useful.

In the meantime ... as to your other points:

  • Yes, I'm profiling and timing the --release version.

  • Yes, I have switched on debugging symbols in release mode.

  • Yes, I know all about perf's a, k ?, and yes, I am getting to see the assembly, source code and line numbers: what I don't know how to do is to find the closure I'm looking for among all the stuff that perf shows me, because I don't know how to search for it, because, being a closure, it doesn't have a name.

    (Though I don't know what you mean by 'hover over the symbol' in perf: I'm in a TUI, where hovering doesn't do anything.)

Yes, but my motivation for creating it in main was not to get an increase in speed, but to pass just a single thing (rather than multiple values) up the stack, because I will be wanting to create a greater variety of closures from a greater variety of inputs. So I'm not disappointed that it hasn't got any faster, but I am disapponted that it has got slower.

That's an interesting observation, in general. But I find it hard to believe that it can be in play here because:

  • The CLI option being set to None means that this closure (regardless of where it is made) is never called.
  • It's some other, seemingly unrelated closure, that slows down.

The difference between n and {n} being that one is inside a block and the other is not? And this makes a difference to whether n is borrowed or not?

Do you have a reference to this example?

Sorry, I don't remember where I saw the example.

The difference with {n}: https://bluss.github.io/rust/fun/2015/10/11/stuff-the-identity-function-does/#rust-has-dedicated-syntax-for-this

I meant highlighting a line. Like the yellow line here:

Yeah, the closures end up getting all sorts of fun names. What I've found is effective is to instead look at and follow the callgraph to find the closure.
There is a way to do this in perf (perf record -a --call-graph dwarf target/..., whose perf report output you see in the above picture) but it ends up not being especially helpful because when you highlight the symbol, perf won't annotate it and will instead say

No samples for the symbol. Probably appeared just in a callchain

Cachegrind ends up being a lot better in this regard:


Clicking on line #192 on the right pane will highlight line #24070 on the left pane.

Trivial question: How did you get a dark theme in cachegrind?

As I've seen it: println! takes everything as a reference. The new scope invoked by {n} actually returns the value of n. That means that Rust can then invoke Copy if the type of n is faster to copy than reference.

In general I 100% agree. However I've also seen some special cases where the code actually ends up nicer to read (and less indented) by extracting the closure to a local.
Of course that was also a situation where a 10-20% drop in performance isn't likely to be noticed at all, so that gave me some leeway in that dimension.

What types might be faster to copy than to reference?

Any type whose size is less than or equal to the size of a reference. Usually a reference takes 8 bytes (on a 64bit system) or 4 bytes (on a 32 bit system). You also need to consider also that reading from a reference is very cheap but not exactly free, so depending on your usage you could see differences even for types a bit larger than a reference.

All primitives (ints, floats, bool, etc.) are faster to copy than reference; basically anything that's usize or smaller. I've also seen it claimed that simple structs (up to ~10x the size of usize) are probably within that window as well (architecture dependent, of course).

At least 2×usize is still completely fine -- there's even special handling for that in ABIs. (See ScalarPair in rustc, if I remember right.)

Probably once over 64 bytes (8×usize, on x64) I'd default to referencing, as that's a common cache line size.

In between? 🤷

1 Like

I haven't tested it, but I suspect types with #[align(4096)] or larger will be quite efficient to copy as well. That's the memory page size on many systems, which means the copies can be implemented purely as page table updates instead of moving the actual bits around.

I didn't have to configure the theme for cachegrind. I'm just globally using the "Breeze Dark" theme on my Kde Plasma desktop. Both KCachegrind and its pure Qt version QCachegrind show up with that same global theme.

Why is copying a reference any slower than copying a usize?

What is a reference beyond a pointer (same size as usize) with a bunch of compile-time rules?

Why?

I assumed that a reference occupies one word (whatever that happens to be on the given architecture), and I always believed that copying one word is no more expensive than copying a fraction of a word: one machine instruction.

I can imagine that with SIMD coming into play, when comparing the cost of copying multiple primitives vs the cost of copying multiple references, the smaller primitives might end up being cheaper. But can copying a single small primitive be cheaper than copying a single reference?