Why is a lookup table faster than a match expression?


#1

I’ve run into a strange performance issue where using a lookup table is sometimes much faster than a match expression, which is quite surprising to me.
I’d expect the match expression to always compile optimally since all of the information is known in advance to the compiler.

This benchmark graph shows that inline lookup is faster than inlined match on the second two scenarios:

The full source code available here: https://github.com/lynaghk/question-rust-inlining

I assume this is something to do with memory or cache levels, but I’m not familiar with that sort of stuff.
I’d love:

  1. Any explanations for this performance
  2. Resources or tips for how to debug and reason about this kind of low-level performance

Thanks fellow Rustaceans!


#2
example::match2:
  xor ecx, ecx
  cmp dil, 66
  sete cl
  add rcx, rcx
  cmp dil, 65
  mov eax, 1
  cmovne rax, rcx
  ret

example::lookup2:
  movzx eax, dil
  lea rcx, [rip + .L__unnamed_1]
  mov rax, qword ptr [rcx + 8*rax]
  ret

as you see, lookup2 is really a lookup (in a table) which makes it incredible fast, match2 on the other hand won’t be translated into a lookup table.

I guess rust tries to find a compromise between space and execution time. To size of one lookup table is 256 * 8 bytes in space, which the compiler says it’s not worth it, so it does the “slow” thing instead.

As you can see in the assembly, there is no real difference between your match2, match4 and match8. They all get compiled down to the exact same instructions (same with lookup..


#3

As for as I know from my “High Performance Computing” class, that’s exactly the point:
Your benchmarks use the first 1000 elements of your lookup table (lut) that are about 8 kB (1000 * 8 Byte) that fits well into your 4 MB cache (https://ark.intel.com/de/products/75114/Intel-Core-i7-4650U-Processor-4M-Cache-up-to-3-30-GHz-). Therefore, the lut will be loaded once into your cache wich makes accesses to it very fast, whereby the match-version will be recalculated every time.
That is a common problem when benchmarking luts. Indeed it is possible that the lookup-version slows down performance in a bigger programm due to requireing to load the lut into cache. The match-version has no need for such loading.
For a better comparision I recommend you to add some additional computation to your benchmarks that ensure your cache gets overwritten by something else, e.g. reading and writting a (min. 4 MB) txt-file.


#4

The match is going to be quick if branch prediction goes your way, which means the lookups need to have a pattern that the CPU’s branch predictor picks up on. If you feed it a random pattern then it may be slower than a fully cached lookup table, as you’re seeing.

Compilers will themselves sometimes lower a match statement (or equivalent in whatever language) to a lookup/jump table. They have cost models/heuristics to determine what’s (likely) better.

As a rule of thumb (and only that!), I’d say given a choice between branches vs memory load, pick branches. Microbenchmarks are almost always going to suggest that lookup tables are fast, but that doesn’t necessarily translate to larger/true workloads because cache, TLB space, and so on is more scarce in those cases.


#5

Thanks for the replies y’all!
I’ll go through and try to understand them.

looking at assembly

@hellow: I don’t know much about assembly, so I looked up x86 architecture and instructions.
Here are my annotations for fellow n00bs, let me know if I got anything wrong.

In both cases, I assume dil contains the argument and rax will be interpreted as holding the return value.

example::match2:
  xor ecx, ecx    //clear register ecx (see https://stackoverflow.com/questions/33666617/what-is-the-best-way-to-set-a-register-to-zero-in-x86-assembly-xor-mov-or-and/33668295 for why it's xor)
  cmp dil, 66     //compare dil with 66, which is ASCII 'B'.
  sete cl         //if comparison was equal, set lowest byte of the rcx register to 1.
  add rcx, rcx    //add rcx to itself, which makes it 2 (which is the mapping we wanted from the match for 'B')
  cmp dil, 65     //compare dil with 65, which is ASCII 'A'.
  mov eax, 1      //move 1 into eax, which is lower 32 bits of rax.
  cmovne rax, rcx //if comparison was not equal (i.e., dil wasn't 'A'), move rcx into rax.
  ret

The lookup is a bit more complicated:

example::lookup2:
  movzx eax, dil              // moves 1-byte dil into lowest byte of 4-byte eax.
  lea rcx,                    // load effective address into rcx
      [rip + .L__unnamed_1]   // the address is the instruction pointer plus the address of the lookup table
  mov rax,                    // move into rax
      qword ptr [rcx + 8*rax] // a pointer's worth of bytes (8) from the address given by `rcx + 8*rax`
  ret

.L__unnamed_1:
  .asciz "\000\000..."        //a string with 8188 characters; each \000 is a byte, so there are 2047 total bytes here.

Grouping the .L__unnamed_1 lookup table data into chunks of 8 bytes, the 65th and 66th rows (counting from 0) are:

\001\000\000\000\000\000\000\000
\002\000\000\000\000\000\000\000

which are the values 1 and 2 in 64-bit little-endian.
Since ‘A’ is 65 and should map to 1, this must be how Rust has compiled the lookup table into, uh, assembly.

A few things I don’t quite follow:

  • What’s the point of the lea instruction? Why can’t the lookup just be mov rax, qword ptr [.L__unnamed_1 + 8*rax]?

  • Putting that aside, why does rip need to be involved in the lea? Isn’t .L__unnamed_1 a fixed address?

  • Why is the lookup table data only 2047 bytes total? In rust, it’s [u64; 256], so shouldn’t that be 8 * 256 = 2048 bytes?

    • Answer: the .asciz pads the string with a final zero byte. It’d be 2048 bytes if the .ascii directive had been used. At some point rust or llvm lost the metadata about what the lookup table was, because if it knew it was bytes it probably would have used the .quad directive. (Via conversation with a friend.)
  • This is just as assembly notation question: The non-zero bytes of the third lookup table (which goes up to 8) looks like

\001\000\000\000\000\000\000\000
\002\000\000\000\000\000\000\000
\003\000\000\000\000\000\000\000
\004\000\000\000\000\000\000\000
\005\000\000\000\000\000\000\000
\006\000\000\000\000\000\000\000
\007\000\000\000\000\000\000\000
\b\000\000\000\000\000\000\000

What’s the deal with that last line?
I’d have expected it to be \010 (if this were octal) or \008 (if this were hex).

on caching

@farnbams: For the case of the alphabet size = 2, it’s not clear how lookup (which loads from a table) could ever be faster than match (which just does arithmetic without any jumps or memory reads).
I wouldn’t expect reading from cache to be faster than xor, cmp, add, and mov.

But lookup is faster for every “inlined mix” and the 4 and 8 sized “mix with rotate” tasks.
Presumably inlining lookup allows for further optimizations that aren’t possible with the match, but I’d need to inspect the generated assembly to test that assumption.

on branch prediction

@vitalyd: From what I can tell, there isn’t any branching going on in the match.
The match2 has no jump or anything.
The match4 has one jump:

example::match4:
  add dil, -65
  cmp dil, 3
  ja .LBB2_2
  movzx eax, dil
  add rax, 1
  ret
.LBB2_2:
  xor eax, eax
  ret

but it only happens when the input data isn’t one of 'A', 'B', 'C', 'D', and that never happens in the synthetic generated data.

My understanding is that the real “penalty” of mispredicting a branch is loading the wrong stuff from memory, and it’s slow to load the right stuff.
But in this case, nothing is being loaded from memory.


#6

Yeah, so when the alphabet size is 2, match is basically same speed as the lookup table.

I agree that match4 only has a single branch, and it should be perfectly predicted. And the performance seems to improve on your CPU when the alphabet size grows and the function is kept outlined. One possible explanation could be that inlining causes the overall instruction mix to have some microarchitectural hazards: execution port contention, data dependency stalls, store forwarding stalls, or a bunch of other possible issues :slight_smile: . If you run your benchmarks under something like perf (linux), you can get it to print CPU counters (eg instructions per cycle ratio, branch hit rates, cache hit rates); in addition, you can request instruction attribution for these things, which can give you a more detailed breakdown.

The penalty is mostly due to killing the speculative execution pipeline, throwing away any work that it’s done, and starting over from the point where the true target has been resolved. How much work that is exactly depends on circumstances and the processor’s pipeline depth.


#7

Your answer is correct. It is about PositionIndependentExecutables.

Your answer is also correct. This is just an optimazation to save 1 byte (seems not much, but think about it if you do it for every string in your executable. You might save a few hundred bytes which is a lot (!).

It’s can be both, its a string notation and you even can specify escape sequences like \n (which should be very familar). \b in this case stands for backspace (0x08).
You can specify it as octal (\010), as hex (\x8) or as escape sequence (\b). For a further explanation see the gnu assembler.


#8

I have no idea what any of this means, but I will generate some more assembly, look at perf and check things out! Thanks!


#9

I haven’t looked at the generated code in your example, but in my experience compilers will often keep an outlined implementation branchless if it has to return some numerical value, but will choose to insert branches into an inlined implementation to facilitate further optimizations on the returned value - in your case, probably just returning rotated-at-compile-time values.

For the larger matches, the compiler may indeed do this, and since you give a random sequence of characters to the benchmark the branches aren’t predictable and this slows you down.


#10

Just FYI: I dug a bit more into this, looking into both LLVM IR and more generated assembly: https://kevinlynagh.com/notes/match-vs-lookup

The conclusion I came to is that match generates a few more instructions than a lookup, which leads to reduced loop unrolling (as well as the overhead of, well, more instructions)

Left unresolved are:

  • Why Rust / LLVM isn’t turning the match into a lookup table (I’d be open to pairing on this via a Skype call if anyone is interested)

  • Why lazy_static causes some spooky action at a distance and makes unrelated code go faster. Would love if anyone wants to try and reproduce on their own hardware.


#11

Of course they are, because you don’t need to go through the sync::Once thing. Also, you don’t have to write all 256 values (if you use a nightly compiler), just use the version I used for godbolt (Playground).


#12

Perhaps Criterion doesn’t sufficiently isolate the functions under test?

Criterion doesn’t do that for you automatically, you need test::black_box for that, or criterion::black_box hack if you’re on stable Rust.