How can I approach the performance of C interpreter that uses computed gotos?

I am working on a Scheme implementation written in Rust. This is mostly for educational purposes, but I want the interpreter to be able to be embedded in Rust programs to provide scripting.

Interpreters written in C often use computed gotos for performance; this can be as much as a factor of 2 I believe. Can Rust achieve similar performance?

NO EXECUTE! September 3, 2008 has some interesting tricks for writing fast interpreter loops. That said, I'm not sure it matters for a scheme interpreter; allocating memory is a lot more expensive than a branch misprediction.

1 Like

While not necessarily answering your question, see http://llvm.org/devmtg/2015-10/slides/Wennborg-SwitchLowering.pdf for the recent switch lowering improvements of LLVM which I found interesting.

Rust doesn't have computed gotos. I'm still learning, but here are the options I believe you have:

The LuaJIT source code is very interesting. The VM is built using a Lua based preprocessor to generate the hand tuned assembler code for the bytecode interpreter. See buildvm.

There is one solution which works in Rust right now: indirect tail calls are jumps, i.e. goto. Example (playpen):

pub fn increment(data: &mut u32, next: fn(&mut u32) -> usize) -> usize {
    *data += 1;
    next(data)
}

In Release mode, this compiles to the following LLVM IR (note the tail call) and assembly:

define i64 @_ZN8rust_out9increment17h4074d1270a8256f5E(i32* dereferenceable(4), i64 (i32*)* nocapture) unnamed_addr #0 {
entry-block:
  %2 = load i32, i32* %0, align 4
  %3 = add i32 %2, 1
  store i32 %3, i32* %0, align 4
  %4 = tail call i64 %1(i32* nonnull dereferenceable(4) %0)
  ret i64 %4
}
_ZN8rust_out9increment17h4074d1270a8256f5E:
	inc	dword ptr [rdi]
	jmp	rsi

I've lost the source since it was just me playing around on playpen, but it is possible to build an interpreter like this, with a function for each bytecode instruction or whatever have you.

There are some caveats to keep in mind:

  • to get tail calls, you can't have any variables or temporaries destructors on the stack, that have to run after the call at the end of your function
  • you can't pass around types larger than a pointer (or twice a pointer, if you switch to C ABIs) because they might get a copy on the stack that LLVM cannot always eliminate, but you can pass around pointers to a common state structure or something similar
  • you need to keep all of your arguments in registers, on x64/linux, you get a maximum of 6 pointer/integer arguments, and the jmp becomes a call with 7 arguments; this might be impossible on x86, or you might need to use a different calling convention

Generally, if you keep an eye on the resulting assembly, you can get pretty good results, but it's a bit of an art.

EDIT: Tiffany found the experiment we did a while back: Rust Playground.

5 Likes

Today at -O3 a match statement where the branches call methods that are inlined will be compiled into a computed jump. Inlining can be forced with #[inline(always)].

2 Likes