#[inline] produces different output than both #[inline(never)] and #[inline(always)]


#1

Backstory

I wanted to build a sed-like regex replacement bot for my telegram chat. Using regex to detect the s/…/…/ pattern wasn’t very nice and I felt that it was overkill. So I created a mini finite state machine of sorts which I tried to make faster and faster. My friend was wondering how much faster my code was compared to regex. I created a ton of benchmarks and many different test strings. I benchmarked them all on all different variations to see what version would be fastest.

Once I had thought I created my fastest version, I tested it and to my surprise, one particular benchmark was much slower than the previous iterations, even though all the other tests were faster.

(best_5 in this was was the abnormally slower one)

image

I decided to try and find out what the issue was, so I started removing benchmarks to clean up the clutter, and it suddenly caused the function to become faster.

image

best_5 had miraculously become faster. I tried to create a minimal working (or breaking for this matter) code that experienced the same anomaly, but almost any small change I had caused the code to become fast. But the original code would always be slow.

Even altering unrelated code caused it to become fast.

Inlining

I had #[inline] on my function, and decided to see what would happen if I altered it.

  • no inline = fast
  • #[inline] = slow
  • #[inline(always)] = fast
  • #[inline(never)] = fast

This made no sense to me. But then again almost any other change to any function made it go fast. (except the removal of regex_0 benchmark)

I have no idea whats going on and I would love to find out

The code in question in its (slow) form: https://github.com/JuanPotato/what-the-heck-is-going-on

Video for the skeptical: https://youtu.be/T4CAL93vO_4

I’ve only gotten one friend that the code behaved the same for them. So I would like if you do test, please say what you experienced and what computer specs you have.

EDIT: I’m not sure what the best way would be to get asm output for the different builds and compare them, but I think that would be beneficial.


#2

Don’t use #[inline] like everywhere - in fact, it would be best to never use #[inline] in executable programs as it usually doesn’t improve performance much, but increases usage of CPU instruction cache, which means things will slow down in a full program (RAM access is relatively slow, the less CPU has to access RAM the better). Your functions are very poor inlining targets being huge as far optimizer is concerned, so inline won’t help or even can harm (in fact, I ran your benchmarks while removing all usages of #[inline], and it changed pretty much nothing). But even if your functions were smaller, the compiler is smart enough to know when to inline or not within the same compilation unit.

(Note: I specifically mention executable programs here, for libraries you may want to use #[inline] for smaller public functions, because without it the compiler won’t inline non-generic functions without enabling LTO)


#3

Observing things like this can indicate changes in memory layout of the program’s working set (data + instructions) causing performance differences.

https://emeryberger.com/research/stabilizer/ is a testing framework built by UMass CS faculty to attempt to mitigate perf differences due strictly to layout.

If you really wanted to dig deeper, you can run the slow and fast version under perf (on Linux) and study the cpu perf counter differences as well as asm-level cycle and event attribution.


#4

Thanks for the tips on inline, I wonder if I can still manage to cause the slowdown without having inline around.

First I gotta look up how to use perf


#5

I bet it’s due to the inlining affecting further inlining decisions within the function. Try this:

RUSTFLAGS="-C remark=inline -C llvm-args=-pass-remarks-analysis=inline -C debuginfo=2"

This will give you a list of every inlining decision made and whether or not it decided to inline (and why). It’s rather… verbose… but especially combined with perf, it might give you a better idea of what’s going on.


#6

That’s what I was thinking while reading but then

threw me off. AFAIK, #[inline] is a hint to the compiler, and it may or may not in-line. But then one of the always/never variants should show the same perf (modulo noise). Unless there are other effects here. And then

really smells of layout differences.


#7

Hmm… I tried comparing #[inline] versus #[inline(never)] on my system. For me, #[inline] appears to be faster (the opposite of OP’s result) but only very slightly, within the margin of error. I tried comparing the generated assembly, and the testing::get_bounds function was almost identical in each, but for some reason with slightly different register allocation (and the functions were in a different order).


#8

I got the llvm ir output for both and diffed them, this was the result. unmodified github code and then removing the inline.

74,76c74,76
< @panic_bounds_check_loc.15 = private unnamed_addr constant { { [0 x i8]*, i64 }, i32, i32 } { { [0 x i8]*, i64 } { [0 x i8]* bitcast ([11 x i8]* @str.12 to [0 x i8]*), i64 11 }, i32 219, i32 13 }, align 8
< @panic_bounds_check_loc.16 = private unnamed_addr constant { { [0 x i8]*, i64 }, i32, i32 } { { [0 x i8]*, i64 } { [0 x i8]* bitcast ([11 x i8]* @str.12 to [0 x i8]*), i64 11 }, i32 230, i32 8 }, align 8
< @panic_bounds_check_loc.17 = private unnamed_addr constant { { [0 x i8]*, i64 }, i32, i32 } { { [0 x i8]*, i64 } { [0 x i8]* bitcast ([11 x i8]* @str.12 to [0 x i8]*), i64 11 }, i32 231, i32 9 }, align 8
---
> @panic_bounds_check_loc.15 = private unnamed_addr constant { { [0 x i8]*, i64 }, i32, i32 } { { [0 x i8]*, i64 } { [0 x i8]* bitcast ([11 x i8]* @str.12 to [0 x i8]*), i64 11 }, i32 220, i32 13 }, align 8
> @panic_bounds_check_loc.16 = private unnamed_addr constant { { [0 x i8]*, i64 }, i32, i32 } { { [0 x i8]*, i64 } { [0 x i8]* bitcast ([11 x i8]* @str.12 to [0 x i8]*), i64 11 }, i32 231, i32 8 }, align 8
> @panic_bounds_check_loc.17 = private unnamed_addr constant { { [0 x i8]*, i64 }, i32, i32 } { { [0 x i8]*, i64 } { [0 x i8]* bitcast ([11 x i8]* @str.12 to [0 x i8]*), i64 11 }, i32 232, i32 9 }, align 8
10847,10848c10847,10848
< ; Function Attrs: uwtable
< define internal fastcc void @_ZN7testing10get_bounds17h62cdf28873d0c3a1E(%"core::option::Option<(&str, &str)>"* noalias nocapture dereferenceable(32), [0 x i8]* noalias nonnull readonly %string.0, i64 %string.1) unnamed_addr #1 personality i32 (i32, i32, i64, %"unwind::libunwind::_Unwind_Exception"*, %"unwind::libunwind::_Unwind_Context"*)* @rust_eh_personality {
---
> ; Function Attrs: inlinehint uwtable
> define internal fastcc void @_ZN7testing10get_bounds17h62cdf28873d0c3a1E(%"core::option::Option<(&str, &str)>"* noalias nocapture dereferenceable(32), [0 x i8]* noalias nonnull readonly %string.0, i64 %string.1) unnamed_addr #6 personality i32 (i32, i32, i64, %"unwind::libunwind::_Unwind_Exception"*, %"unwind::libunwind::_Unwind_Context"*)* @rust_eh_personality {