Bytecode interpreter, match, cranelift/codegen

zeroexcuses · May 25, 2021, 7:13pm

Suppose we are writing a bytecode interpreter in Rust. At the innermost loop, we need to interpret instructions. Are the only two options:

a giant match statement, executed per instruction
codegen, either cranelift, wat/wasm32, or something else

It seems that (1) is very inefficient -- we screw up branch prediction per instruction executed, despite all the instructions being laid out in a Vec, while (2) is quite heavy weight / not arch-portable, and might be complicated to interact with Rust types (say, Rc, Vec, ...)

Are these the only two options? Is there a third option?

skysch · May 25, 2021, 7:19pm

What do mean by "screw up branch prediction"? If you're writing an interpreter, your bytecode is not being executed by the processor, so it isn't being branch predicted, it should just live in data cache. Unless you implement your own branch predictor. (Or you mean something else?)

zeroexcuses · May 25, 2021, 7:28pm

Imagine something like this:

reg1 = reg2 + reg3
reg4 = reg1 * reg2
reg5 = reg1 - reg4
...

If this was in x86_64, there is no branch prediction. We just execute a bunch of instructions one after another.

If this is bytecode in some VM we have, there is giant match statement that is executed before every instruction, and the CPU, afaik can't predict what the match is going to do -- but we do know what the match is going to do because all the instructions are already laid out in the Vec.

jschievink · May 25, 2021, 7:31pm

If you write any sort of interpreter you will pretty much inevitably end up with a hot indirect branch, no matter how you structure it. There are various microoptimizations that might help, but none of them completely remove this branch.

Note that modern processors are very good at predicting indirect branches, so nothing is really getting "screwed up" here.

zeroexcuses · May 25, 2021, 7:33pm

Really? How true this is for bytecode interpreters? I was under the (possibly outdated) impression that modern branch prediction =

keep track of which way last N precitions went
stuff this into a neural network to predict which way next one goes

this works great if we have something like:

  for(...)
  while(...)

but seems like it works very poorly if the last N executed instructions is supposed to predict what the next executed instruction is supposed to be (in the case of a match per instruction executed by bytecode interpreter)

skysch · May 25, 2021, 7:44pm

Unless you're compiling your bytecode into something the CPU executes natively, there's little you can do about that though. That's one of the many reasons that interpreters aren't as performant as native code.

Anyway, it is possible that the CPU sees that you are branching on data read from your bytecode Vec and begins to predict which branches you take based on that data. It's not so far fetched. (The reliability of this would likely depend on the uniformity of your bytecode, which depends on how you design it and what the program is doing.)

jschievink · May 25, 2021, 7:54pm

I remember reading an insightful paper about this, I think it was Branch Prediction and the Performance of Interpreters - Don't Trust Folklore.

riking · May 25, 2021, 9:21pm

Take a look at the conversation around the become keyword, which is intended for exactly this.

zeroexcuses · May 25, 2021, 10:15pm

Link? Googling 'rust become' returns all types of noise.

quinedot · May 25, 2021, 10:37pm

RFC 1888 is the open one for guaranteed TCO (become).

Edit: Err, it's closed. It's where recent discussion has been, anyway.

zeroexcuses · May 25, 2021, 10:40pm

I'm sorry, I am missing something really fundamental.

What is the relationship between TailCallOptimization and eliminating the match EnumInstr { ... } dispatch in the middle of an bytecode interpreter? My interpreter is not blowing up the stack; oftentimes, there's not a function call (i.e. I can #[inline(always)] all the functions called in the branches of the match arm).

I don't see the relation between TCO and eliminating the match statement.

quinedot · May 25, 2021, 10:44pm

No idea. Maybe it's not what @riking was actually referring to, but TCO is what become was reserved for.

riking · May 25, 2021, 10:52pm

Here's a blog post talking about how musttail / become makes interpreters efficient:

Instead of loop { match { ... } }, you write

fn dispatch(&mut self, ..) {
  match {
    ...
      become dispatch(self, ..);
    ...
  }
}

zeroexcuses · May 26, 2021, 4:28am

I tried treading the article. I do not understand how it helps for the following reason:

x86_64 has a deep pipeline, if the CPU does not know the next N instrs to execute, performance suffers; if the CPU guesses wrong, performance suffers
the functions in the proto buf parsing looks more complicated than the typical low level VM bytecode instr
low level VM bytecode instr is not going to have register spilling issues
low level VM bytecode (unlike the proto buf instrs), would often only be just 1 real instr per VM instr; which makes it difficult to keep the pipeline full

What am I misunderstanding ?

zeroexcuses · May 26, 2021, 8:28pm

@riking : I understand that the article you linked claims that TCO helps with interpreter inner loop match statement.

I don't understand how the argument works.

Can you explain it in your own words here ?

zeroexcuses · June 1, 2021, 6:54am

@riking: Can you explain why this helps with branch prediction? The linked article makes little sense to me.

bjorn3 · June 1, 2021, 9:09am

riking:

Here's a blog post talking about how musttail / become makes interpreters efficient:

blog.reverberate.org

Parsing Protobuf at 2+GB/s: How I Learned To Love Tail Calls in C

Parsing, performance, and low-level programming.

Instead of loop { match { ... } } , you write
riking:
fn dispatch(&mut self, ..) {
  match {
    ...
      become dispatch(self, ..);
    ...
  }
}
@riking: Can you explain why this helps with branch prediction? The linked article makes little sense to me.

The blog post doesn't say that it helps with branch prediction. It helps with register allocation. It gives the following reasons:

The larger a function is, and the more complex and connected its control flow, the harder it is for the compiler’s register allocator to keep the most important data in registers.
When fast paths and slow paths are intermixed in the same function, the presence of the slow paths compromises the code quality of the fast paths.

tl;dr: it tricks the register allocator into generating more efficient code for the fast path.

It does help with branch prediction if the dispatch function is inlined (but the branch targets outlined, so each match arm is of the form State::Foo => become state_foo(self, ...)), as the branch prediction can predict the next target for each current state independently due to the jump to the next state being duplicated for each current state. loop { match { ... } } would result in a single dispatch jump for all possible current states, preventing the branch prediction from seeing that if you are in one state, you are more likely to jump to a certain next state than if you are in another state.

Phlopsi · June 1, 2021, 9:45am

I think this is essential to understand why it does not help here. If there is no fast and slow path, branch prediction doesn't help. Interpreting some slice of user-generated bytes into a series of instructions is unpredictable. The more branches there are, the less likely it is for the branch predictor to guess correctly.

JIT compilers like Java Hotspot™ compile the JVM Bytecode into machine code depending on heuristics. This skips the translation layer completely, resulting in possible code execution speedups by a factor of 10¹⁺. The optimization potential is huge, which is why it's worth investing time into it, first, rather than replacing some other algorithms.

zeroexcuses · June 1, 2021, 7:11pm

Thanks for clarifying this. I (incorrectly) expected the blog post to explain how become solved the problem of branch prediction.

Yes, I think this is the heart of the issue.

bjorn · June 7, 2021, 8:44am

The computed goto technique seems relevant here, though it's described there as a GCC extension so I'm not sure how relevant this is for Rust. It helps the branch prediction be more effective while processing the instructions, and is used by the Python, Ruby (YARV) and Dalvik VMs.

I first heard about this while reading about Wren's performance:

Using computed gotos gives you a separate branch point at the end of each instruction. Each gets its own branch prediction, which often succeeds since some instruction pairs are more common than others. In my rough testing, this makes a 5-10% performance difference.

Edit: I see this technique is already covered by the previously linked paper, under the name "Jump threading".

Topic		Replies	Views
Working on a Scheme interpreter written in Rust announcements	4	2320	January 12, 2023
Rust JIT / dynamic code generation	3	2038	July 29, 2020
Just curious: how hard would it be to write an interpreter for Rust?	8	438	January 17, 2024
How to prevent bad code genration while using intrinsics in rust help	6	480	September 14, 2023
Example toy language -> x86_64 compilers in pure Rust?	1	300	December 5, 2021

Bytecode interpreter, match, cranelift/codegen

Related Topics