#[inline] produces different output than both #[inline(never)] and #[inline(always)]

Backstory

I wanted to build a sed-like regex replacement bot for my telegram chat. Using regex to detect the s/.../.../ pattern wasn't very nice and I felt that it was overkill. So I created a mini finite state machine of sorts which I tried to make faster and faster. My friend was wondering how much faster my code was compared to regex. I created a ton of benchmarks and many different test strings. I benchmarked them all on all different variations to see what version would be fastest.

Once I had thought I created my fastest version, I tested it and to my surprise, one particular benchmark was much slower than the previous iterations, even though all the other tests were faster.

(best_5 in this was was the abnormally slower one)

image

I decided to try and find out what the issue was, so I started removing benchmarks to clean up the clutter, and it suddenly caused the function to become faster.

image

best_5 had miraculously become faster. I tried to create a minimal working (or breaking for this matter) code that experienced the same anomaly, but almost any small change I had caused the code to become fast. But the original code would always be slow.

Even altering unrelated code caused it to become fast.

Inlining

I had #[inline] on my function, and decided to see what would happen if I altered it.

  • no inline = fast
  • #[inline] = slow
  • #[inline(always)] = fast
  • #[inline(never)] = fast

This made no sense to me. But then again almost any other change to any function made it go fast. (except the removal of regex_0 benchmark)

I have no idea whats going on and I would love to find out

The code in question in its (slow) form: https://github.com/JuanPotato/what-the-heck-is-going-on

Video for the skeptical: What the heck is going on - YouTube

I've only gotten one friend that the code behaved the same for them. So I would like if you do test, please say what you experienced and what computer specs you have.

EDIT: I'm not sure what the best way would be to get asm output for the different builds and compare them, but I think that would be beneficial.

7 Likes

Don't use #[inline] like everywhere - in fact, it would be best to never use #[inline] in executable programs as it usually doesn't improve performance much, but increases usage of CPU instruction cache, which means things will slow down in a full program (RAM access is relatively slow, the less CPU has to access RAM the better). Your functions are very poor inlining targets being huge as far optimizer is concerned, so inline won't help or even can harm (in fact, I ran your benchmarks while removing all usages of #[inline], and it changed pretty much nothing). But even if your functions were smaller, the compiler is smart enough to know when to inline or not within the same compilation unit.

(Note: I specifically mention executable programs here, for libraries you may want to use #[inline] for smaller public functions, because without it the compiler won't inline non-generic functions without enabling LTO)

1 Like

Observing things like this can indicate changes in memory layout of the program’s working set (data + instructions) causing performance differences.

Stabilizer | Emery Berger is a testing framework built by UMass CS faculty to attempt to mitigate perf differences due strictly to layout.

If you really wanted to dig deeper, you can run the slow and fast version under perf (on Linux) and study the cpu perf counter differences as well as asm-level cycle and event attribution.

2 Likes

Thanks for the tips on inline, I wonder if I can still manage to cause the slowdown without having inline around.

First I gotta look up how to use perf

I bet it's due to the inlining affecting further inlining decisions within the function. Try this:

RUSTFLAGS="-C remark=inline -C llvm-args=-pass-remarks-analysis=inline -C debuginfo=2"

This will give you a list of every inlining decision made and whether or not it decided to inline (and why). It's rather… verbose… but especially combined with perf, it might give you a better idea of what's going on.

1 Like

That’s what I was thinking while reading but then

threw me off. AFAIK, #[inline] is a hint to the compiler, and it may or may not in-line. But then one of the always/never variants should show the same perf (modulo noise). Unless there are other effects here. And then

really smells of layout differences.

Hmm… I tried comparing #[inline] versus #[inline(never)] on my system. For me, #[inline] appears to be faster (the opposite of OP's result) but only very slightly, within the margin of error. I tried comparing the generated assembly, and the testing::get_bounds function was almost identical in each, but for some reason with slightly different register allocation (and the functions were in a different order).

I got the llvm ir output for both and diffed them, this was the result. unmodified github code and then removing the inline.

74,76c74,76
< @panic_bounds_check_loc.15 = private unnamed_addr constant { { [0 x i8]*, i64 }, i32, i32 } { { [0 x i8]*, i64 } { [0 x i8]* bitcast ([11 x i8]* @str.12 to [0 x i8]*), i64 11 }, i32 219, i32 13 }, align 8
< @panic_bounds_check_loc.16 = private unnamed_addr constant { { [0 x i8]*, i64 }, i32, i32 } { { [0 x i8]*, i64 } { [0 x i8]* bitcast ([11 x i8]* @str.12 to [0 x i8]*), i64 11 }, i32 230, i32 8 }, align 8
< @panic_bounds_check_loc.17 = private unnamed_addr constant { { [0 x i8]*, i64 }, i32, i32 } { { [0 x i8]*, i64 } { [0 x i8]* bitcast ([11 x i8]* @str.12 to [0 x i8]*), i64 11 }, i32 231, i32 9 }, align 8
---
> @panic_bounds_check_loc.15 = private unnamed_addr constant { { [0 x i8]*, i64 }, i32, i32 } { { [0 x i8]*, i64 } { [0 x i8]* bitcast ([11 x i8]* @str.12 to [0 x i8]*), i64 11 }, i32 220, i32 13 }, align 8
> @panic_bounds_check_loc.16 = private unnamed_addr constant { { [0 x i8]*, i64 }, i32, i32 } { { [0 x i8]*, i64 } { [0 x i8]* bitcast ([11 x i8]* @str.12 to [0 x i8]*), i64 11 }, i32 231, i32 8 }, align 8
> @panic_bounds_check_loc.17 = private unnamed_addr constant { { [0 x i8]*, i64 }, i32, i32 } { { [0 x i8]*, i64 } { [0 x i8]* bitcast ([11 x i8]* @str.12 to [0 x i8]*), i64 11 }, i32 232, i32 9 }, align 8
10847,10848c10847,10848
< ; Function Attrs: uwtable
< define internal fastcc void @_ZN7testing10get_bounds17h62cdf28873d0c3a1E(%"core::option::Option<(&str, &str)>"* noalias nocapture dereferenceable(32), [0 x i8]* noalias nonnull readonly %string.0, i64 %string.1) unnamed_addr #1 personality i32 (i32, i32, i64, %"unwind::libunwind::_Unwind_Exception"*, %"unwind::libunwind::_Unwind_Context"*)* @rust_eh_personality {
---
> ; Function Attrs: inlinehint uwtable
> define internal fastcc void @_ZN7testing10get_bounds17h62cdf28873d0c3a1E(%"core::option::Option<(&str, &str)>"* noalias nocapture dereferenceable(32), [0 x i8]* noalias nonnull readonly %string.0, i64 %string.1) unnamed_addr #6 personality i32 (i32, i32, i64, %"unwind::libunwind::_Unwind_Exception"*, %"unwind::libunwind::_Unwind_Context"*)* @rust_eh_personality {

I ran into a similar surprise, except I wasn't even playing with #inline. I simply un-inlined a short expression into a function, and benchmarking reports a serious performance boost. I don't think there's anything more simple than my ssomers/rust_bench_inlining github repository to demonstrate.

On my system, typical output is:

test btreeset_peek_fast ... bench:           5 ns/iter (+/- 0)
test btreeset_peek_slow ... bench:           9 ns/iter (+/- 0)

where the slow bench is the inlined code. Edited with N=100, it's more than twice slower.

2 Likes

I got super similar results with

test btreeset_peek_fast ... bench:           4 ns/iter (+/- 0)
test btreeset_peek_slow ... bench:          10 ns/iter (+/- 0)

That's really odd. Adding #[inline] to btreeset_peek makes the fast one slow. But why.

I've had Visual C++ benchmark code speed up almost twice for no good reason, but then I couldn't get it to behave consistently. Here it's even more and entirely consistently, it seems. Swap the order in which the benchmarks are defined, same numbers. Change the names i.e. the order in which they're executed, same numbers. Build and run each bench separately: same numbers.

Ran it a few times, btreeset_peek does get inlined with slow but part of the benchmarking test::ns_iter_inner doesn't get inlined in this case. The compiler seems to be picking to use some (obviously slower) memory/register and simid instructions. (The assembly beyond my little knowledge.)

Run with RUSTFLAGS="-C lto=y" has them for me showing the same but slow time. Couldn't find any of the other flags that had much of an impact. (But not tried everything.)