Bytecode interpreter, match, cranelift/codegen

Suppose we are writing a bytecode interpreter in Rust. At the innermost loop, we need to interpret instructions. Are the only two options:

  1. a giant match statement, executed per instruction

  2. codegen, either cranelift, wat/wasm32, or something else

It seems that (1) is very inefficient -- we screw up branch prediction per instruction executed, despite all the instructions being laid out in a Vec, while (2) is quite heavy weight / not arch-portable, and might be complicated to interact with Rust types (say, Rc, Vec, ...)

Are these the only two options? Is there a third option?

What do mean by "screw up branch prediction"? If you're writing an interpreter, your bytecode is not being executed by the processor, so it isn't being branch predicted, it should just live in data cache. Unless you implement your own branch predictor. (Or you mean something else?)

1 Like

Imagine something like this:

reg1 = reg2 + reg3
reg4 = reg1 * reg2
reg5 = reg1 - reg4
...

If this was in x86_64, there is no branch prediction. We just execute a bunch of instructions one after another.

If this is bytecode in some VM we have, there is giant match statement that is executed before every instruction, and the CPU, afaik can't predict what the match is going to do -- but we do know what the match is going to do because all the instructions are already laid out in the Vec.

If you write any sort of interpreter you will pretty much inevitably end up with a hot indirect branch, no matter how you structure it. There are various microoptimizations that might help, but none of them completely remove this branch.

Note that modern processors are very good at predicting indirect branches, so nothing is really getting "screwed up" here.

3 Likes

Really? How true this is for bytecode interpreters? I was under the (possibly outdated) impression that modern branch prediction =

  • keep track of which way last N precitions went
  • stuff this into a neural network to predict which way next one goes

this works great if we have something like:

  for(...)
  while(...)

but seems like it works very poorly if the last N executed instructions is supposed to predict what the next executed instruction is supposed to be (in the case of a match per instruction executed by bytecode interpreter)

Unless you're compiling your bytecode into something the CPU executes natively, there's little you can do about that though. That's one of the many reasons that interpreters aren't as performant as native code.

Anyway, it is possible that the CPU sees that you are branching on data read from your bytecode Vec and begins to predict which branches you take based on that data. It's not so far fetched. (The reliability of this would likely depend on the uniformity of your bytecode, which depends on how you design it and what the program is doing.)

4 Likes

I remember reading an insightful paper about this, I think it was Branch Prediction and the Performance of Interpreters - Don't Trust Folklore.

4 Likes

Take a look at the conversation around the become keyword, which is intended for exactly this.

Link? Googling 'rust become' returns all types of noise.

RFC 1888 is the open one for guaranteed TCO (become).

Edit: Err, it's closed. It's where recent discussion has been, anyway.

I'm sorry, I am missing something really fundamental.

What is the relationship between TailCallOptimization and eliminating the match EnumInstr { ... } dispatch in the middle of an bytecode interpreter? My interpreter is not blowing up the stack; oftentimes, there's not a function call (i.e. I can #[inline(always)] all the functions called in the branches of the match arm).

I don't see the relation between TCO and eliminating the match statement.

No idea. Maybe it's not what @riking was actually referring to, but TCO is what become was reserved for.

Here's a blog post talking about how musttail / become makes interpreters efficient:

Instead of loop { match { ... } }, you write

fn dispatch(&mut self, ..) {
  match {
    ...
      become dispatch(self, ..);
    ...
  }
}
5 Likes

I tried treading the article. I do not understand how it helps for the following reason:

  1. x86_64 has a deep pipeline, if the CPU does not know the next N instrs to execute, performance suffers; if the CPU guesses wrong, performance suffers

  2. the functions in the proto buf parsing looks more complicated than the typical low level VM bytecode instr

  3. low level VM bytecode instr is not going to have register spilling issues

  4. low level VM bytecode (unlike the proto buf instrs), would often only be just 1 real instr per VM instr; which makes it difficult to keep the pipeline full

What am I misunderstanding ?

@riking : I understand that the article you linked claims that TCO helps with interpreter inner loop match statement.

I don't understand how the argument works.

Can you explain it in your own words here ?

@riking: Can you explain why this helps with branch prediction? The linked article makes little sense to me.

The blog post doesn't say that it helps with branch prediction. It helps with register allocation. It gives the following reasons:

The larger a function is, and the more complex and connected its control flow, the harder it is for the compiler’s register allocator to keep the most important data in registers.
When fast paths and slow paths are intermixed in the same function, the presence of the slow paths compromises the code quality of the fast paths.

tl;dr: it tricks the register allocator into generating more efficient code for the fast path.

It does help with branch prediction if the dispatch function is inlined (but the branch targets outlined, so each match arm is of the form State::Foo => become state_foo(self, ...)), as the branch prediction can predict the next target for each current state independently due to the jump to the next state being duplicated for each current state. loop { match { ... } } would result in a single dispatch jump for all possible current states, preventing the branch prediction from seeing that if you are in one state, you are more likely to jump to a certain next state than if you are in another state.

3 Likes

I think this is essential to understand why it does not help here. If there is no fast and slow path, branch prediction doesn't help. Interpreting some slice of user-generated bytes into a series of instructions is unpredictable. The more branches there are, the less likely it is for the branch predictor to guess correctly.

JIT compilers like Java Hotspot™ compile the JVM Bytecode into machine code depending on heuristics. This skips the translation layer completely, resulting in possible code execution speedups by a factor of 101+. The optimization potential is huge, which is why it's worth investing time into it, first, rather than replacing some other algorithms.

Thanks for clarifying this. I (incorrectly) expected the blog post to explain how become solved the problem of branch prediction.

Yes, I think this is the heart of the issue.

1 Like

The computed goto technique seems relevant here, though it's described there as a GCC extension so I'm not sure how relevant this is for Rust. It helps the branch prediction be more effective while processing the instructions, and is used by the Python, Ruby (YARV) and Dalvik VMs.

I first heard about this while reading about Wren's performance:

Using computed gotos gives you a separate branch point at the end of each instruction. Each gets its own branch prediction, which often succeeds since some instruction pairs are more common than others. In my rough testing, this makes a 5-10% performance difference.

Edit: I see this technique is already covered by the previously linked paper, under the name "Jump threading".

2 Likes