Odd benchmark results; compiler optimisation wall?

Hi,

Some background

I maintain a Rust library called Frunk. It's currently a grab bag of functional programming constructs like HList, Monoid, Semigroup, etc. Among these is LabelledGeneric, which, when mixed with HList allows one to do boilerplate-free transforms between structs.

How conversion works

To give a quick overview of what happens (code-wise, anyway) when you transform one struct (Struct A, let's say) into another (Struct B):

  1. Struct A is turned into a "type-labelled" HList (names of fields are included on the type level); let's call this LabelledRep A. This is done by making use of an implementation of the LabelledGeneric trait that is trivially generated using a custom derivation.
  2. LabelledRep A is "rearranged"; or "sculpted" to fit the LabelledGeneric Representation of Struct B using another trait that is implemented for HLists. This is done recursively on the stack and largely driven by the compiler at compile time. Let's call the end-result of this LabelledRep B
  3. LabelledRep B (remember, it's an HList), is deconstructed, and its values are used to construct Struct B.

Issue

In an attempt to maximise the chances that all of this is a zero-cost abstraction, I stuck to using the stack when writing it, and it largely works to that end. In normal cases, there is zero difference between using LabelledGeneric::transform_from vs From::from (as you'll see below). This by the way, is extremely impressive and speaks volumes about Rust's compiler and its integration with LLVM; I can't say any more because I really have no idea how it manages to optimise away all those steps, but I am really impressed and happy with it.

However, when I try to push things to its limits, I get some weird results when benchmarking.

To benchmark, I wanted to use the scenario where LabelledGeneric-based transformation would be at its slowest. Due to the way "sculpting" works, this is when one struct's fields are in reverse order of the other e.g. BigStruct { a: i8, b: String, ... } to BigStructReverse { z:i8, y: String, ..}

Comparing LabelledGeneric-backed transform_from vs hand-written Fromwhen the struct has 24 fields, the results look quite favourable for transform_from:

test big_from_24fields           ... bench:         116 ns/iter (+/- 19)
test big_transform_from_24fields ... bench:         106 ns/iter (+/- 17)

But once we start doing it with a struct containing 25 fields, performance falls off a cliff for transform_from!!:

test big_from_25fields           ... bench:         133 ns/iter (+/- 68)
test big_transform_from_25fields ... bench:       4,919 ns/iter (+/- 1,400)

Now, for the real kicker, if I have both of the above benchmarks enabled, big_transform_from_24fields also takes a nosedive!!

test big_from_24fields           ... bench:         110 ns/iter (+/- 50)
test big_from_25fields           ... bench:         131 ns/iter (+/- 5)
test big_transform_from_24fields ... bench:       4,594 ns/iter (+/- 180) <-- !!! This used to be 106ns!
test big_transform_from_25fields ... bench:       4,682 ns/iter (+/- 662)

At the moment I really don't know why this is the case. Maybe this is something that someone with knowledge on looking at compiled Rust code or the compiler can take a look at. I'm curious but sadly don't have the skills :slight_smile:

Benchmark code here

Note that I obtained these results from nightly Rust from ~ 3 days ago.

Ask

Like I said, I'm really happy that the Rust compiler + LLVM are seemingly able to make LabelledGeneric::transform_from a zero-cost abstraction. I also understand that for the most part nobody is going to be using 24+ fields in their structs.

I'm just really curious about these results and am open to pointers on how/where to start investigating. My current thoughts are:

  • Maybe my benchmarks are bogus (written wrong)?
  • Maybe this is related to the new optimisation fuel stuff checked into nightly (compiler ran out of optimisation fuel at 24+ fields)?

Thanks for reading, and any help is appreciated.

I haven't looked at your code in detail, but it looks like you have a lot of nested function calls which can ultimately be simplified into relatively trivial code, but only if they're inlined. Thus, I suspect the performance cliff might have to do with inlining. As a test, you could try marking all the functions being called #[inline(always)] and see if it changes the results…

(That explanation would also help account for benchmarks affecting other benchmarks: it's much easier to decide that a function is worth inlining if it only has a single caller, i.e. inlining it won't cause duplication.)

You might also be running into - or will run into with even larger structs - thresholds for SROA, scalar replacement of aggregates, i.e. breaking up a struct-typed variable into separate variables for each of the fields. If I'm reading this LLVM code right, the default limit is 128 bytes or 32 fields, whichever is less:

http://llvm.org/docs/doxygen/html/ScalarReplAggregates_8cpp_source.html

edit: I think that code is out of date though, compared to nightly LLVM

@comex Thanks for the suggestions.

I'll give the inlining a try, well right after I figure out why I'm now getting the following error after updating my nightly:

Assertion failed: (ReqTy && "extractvalue indices invalid!"), function getExtractValue, file /Users/travis/build/rust-lang/rust/src/llvm/lib/IR/Constants.cpp, line 2096.

Such an LLVM assertion is always a compiler bug, should be filed as ICE in the rust repo.

Thanks done.

@comex: I was able to get the benchmarks working by doing

rustup run nightly-2017-04-20 cargo bench (edit: and nightly-2017-04-21..breaks from 2017-04-22).

and tried out using #[inline(always)]. Good news: it works; and I only needed to put it on the implementations of one function of one trait. LabelledGeneric::transform_from is now a zero-cost abstraction:

test big_from_24fields           ... bench:         109 ns/iter (+/- 49)
test big_from_25fields           ... bench:         129 ns/iter (+/- 9)
test big_transform_from_24fields ... bench:         104 ns/iter (+/- 24)
test big_transform_from_25fields ... bench:         131 ns/iter (+/- 13)

I'm a bit hesitant about using inline always because I've read recommendations to not use it, and I feel that the compiler would be a better judge of what to inline that I am; but the performance numbers don't lie.

I'm actually quite amazed that the Rust compiler can turn what I'm doing into something with zero runtime cost. Is there something I can start looking at to understand how it does its magic?

Inlining uses a threshold for function size that decides if it is going to include the code verbatim or not.

What I understand this process is always from the bottom up (in the call graph). It is reasonable that you have a couple of functions whose code size grows (before optimizations) linearly with the number of struct fields.

In that case you could always find a change in behaviour at some field count, but it's not like it's the same field count for every struct or use case.

1 Like

Have you also tried to use just "#[inline]" ?

1 Like

Yeah, I did. I advanced to always only after I saw that it had no effect. In case you were curious.

test big_from_24fields           ... bench:         107 ns/iter (+/- 2)
test big_from_25fields           ... bench:         132 ns/iter (+/- 4)
test big_transform_from_24fields ... bench:       4,598 ns/iter (+/- 205)
test big_transform_from_25fields ... bench:       4,686 ns/iter (+/- 347)