How can I approach the performance of C interpreter that uses computed gotos?

dobenour · June 18, 2016, 2:31am

I am working on a Scheme implementation written in Rust. This is mostly for educational purposes, but I want the interpreter to be able to be embedded in Rust programs to provide scripting.

Interpreters written in C often use computed gotos for performance; this can be as much as a factor of 2 I believe. Can Rust achieve similar performance?

eefriedman · June 18, 2016, 3:46am

NO EXECUTE! September 3, 2008 has some interesting tricks for writing fast interpreter loops. That said, I'm not sure it matters for a scheme interpreter; allocating memory is a lot more expensive than a branch misprediction.

astocko · June 18, 2016, 1:45pm

While not necessarily answering your question, see http://llvm.org/devmtg/2015-10/slides/Wennborg-SwitchLowering.pdf for the recent switch lowering improvements of LLVM which I found interesting.

Rust doesn't have computed gotos. I'm still learning, but here are the options I believe you have:

~~Compiler plugin along with a LLVM pass~~ to generate your computed gotos. See indirectbr and Address of Label and Indirect Branches. Ex: llvm-pass. (Edit: doesn't seem possible.)
Use the inline assembler

The LuaJIT source code is very interesting. The VM is built using a Lua based preprocessor to generate the hand tuned assembler code for the bytecode interpreter. See buildvm.

eddyb · June 18, 2016, 5:14pm

There is one solution which works in Rust right now: indirect tail calls are jumps, i.e. goto. Example (playpen):

pub fn increment(data: &mut u32, next: fn(&mut u32) -> usize) -> usize {
    *data += 1;
    next(data)
}

In Release mode, this compiles to the following LLVM IR (note the tail call) and assembly:

define i64 @_ZN8rust_out9increment17h4074d1270a8256f5E(i32* dereferenceable(4), i64 (i32*)* nocapture) unnamed_addr #0 {
entry-block:
  %2 = load i32, i32* %0, align 4
  %3 = add i32 %2, 1
  store i32 %3, i32* %0, align 4
  %4 = tail call i64 %1(i32* nonnull dereferenceable(4) %0)
  ret i64 %4
}

_ZN8rust_out9increment17h4074d1270a8256f5E:
	inc	dword ptr [rdi]
	jmp	rsi

I've lost the source since it was just me playing around on playpen, but it is possible to build an interpreter like this, with a function for each bytecode instruction or whatever have you.

There are some caveats to keep in mind:

to get tail calls, you can't have any variables or temporaries destructors on the stack, that have to run after the call at the end of your function
you can't pass around types larger than a pointer (or twice a pointer, if you switch to C ABIs) because they might get a copy on the stack that LLVM cannot always eliminate, but you can pass around pointers to a common state structure or something similar
you need to keep all of your arguments in registers, on x64/linux, you get a maximum of 6 pointer/integer arguments, and the jmp becomes a call with 7 arguments; this might be impossible on x86, or you might need to use a different calling convention

Generally, if you keep an eye on the resulting assembly, you can get pretty good results, but it's a bit of an art.

EDIT: Tiffany found the experiment we did a while back: Rust Playground.

tkaitchuck · June 4, 2020, 3:00am

Today at -O3 a match statement where the branches call methods that are inlined will be compiled into a computed jump. Inlining can be forced with #[inline(always)].

Topic		Replies	Views
Goto in Rust. For tail call optimization	4	1804	December 14, 2019
Bytecode interpreter, match, cranelift/codegen	20	1573	September 5, 2021
How hard would it be to build a Rust transpiler that translates Rust to C? help	32	6254	April 24, 2023
I want to execute Rust code without compiling (like JavaScript, Python), is there an easy way to do that? help	5	643	August 20, 2020
Working on a Scheme interpreter written in Rust announcements	4	2517	January 12, 2023

How can I approach the performance of C interpreter that uses computed gotos?

Related topics