Self-implemented Lua VM is much slower than lua5.1

Hello,

I am somewhat a beginner and hoped to learn rust by implementing a Lua runtime in rust. It is working, but much slower than the official implementation (lua5.1). Despite the byte code is not optimised (2 times longer than luac byte code), it is still 10 times slower.

Some implementation details:

  1. I added a context type to emulate the behaviour of blocks and closures in lua:
    pub struct Context {
        pub(crate) locals: Vec<Value>,
        pub(crate) ancestors: Vec<Rc<RefCell<Context>>>
    }
    
    As a result, the variables are referenced by (offset, index), where offset represents the number of times we need to look up from current context.
  2. I did not pre-load the variables onto stacks, only load them and discard immediately, so even if the vm is register-based, it still behaves like stack, also leading to the longer byte code.

The loop body looks like the following, where:
1. set_stack and get_stack simply sets and refers to the value on stack
2. set_local and get_local simply looks up in current context or its ancestors

while let Some(bc) = compiled.byte_codes.get(pc) {
    pc += 1;
    match bc.clone() {
        NOP => continue,
        Halt => break,
        PushContext => self.push_context(),
        PopContext => self.pop_context(),
        LoadInt | LoadNil | LoadBool => ..,

        // i used a macro here to generate the pattern and match expr
        // because macro cannot expand to match arms
        bc @ (Not | BitNot | Neg | Len) => match bc {
            ...
        }

        // same here
        bc @ (And | Or | Xor | ... ) => match bc {

        }

        GetLocal(dst, i, j) => {
            let value = self.get_local(i, j)?;
            self.set_stack(dst, value);
        },
        SetLocal(i, j ,src) => {
            let value = self.get_stack(src).clone();
            self.set_local(i, j, value)?;
        },
        LoadConst(dst, src) => {
            let value = compiled.get_literal(src)?.clone();
            self.set_stack(dst, value.into());
        },

        // Call Ret function table ommited

        JmpNot(cond, forward) => {
            let cond: bool = self.get_stack(cond).into();
            if !cond {
                pc = (pc as isize + forward as isize) as usize;
            }
        }
        JmpWhen(cond, forward) => {
            let cond: bool = self.get_stack(cond).into();
            if cond {
                pc = (pc as isize + forward as isize) as usize;
            }
        }
        Jmp(forward) => {
            pc = (pc as isize + forward as isize) as usize;
        }
    }
}

The measurement of running time of the match expr is as follows:

Variant Count Total(s) Average(s)
GetLocal 14116415 0.58360328 0.0000000413421734
SetLocal 6385074 0.24584601 0.0000000385032358
LoadConst 1000000 0.03518509 0.0000000351850910
LoadInt 6385074 0.22127918 0.0000000346556947
Call 2692538 0.11427579 0.0000000424416629
Add 2346267 0.09794354 0.0000000417444153
Sub 2692536 0.11267531 0.0000000418472808
Less 3692537 0.13302776 0.0000000360261135
JmpNot 3692537 0.10446466 0.0000000282907557
Jmp 999999 0.02705205 0.0000000270520801
PushContext 2692538 0.17783500 0.0000000660473501
PopContext 2692538 0.14251488 0.0000000529295720
LoadFunc 1 0.00000187 0.0000018700000000
Ret 2692537 0.07430846 0.0000000275979353

And the lua code i used to test:

function fib(n)
   if n < 2 then
      return n
   end

   local n1 = fib(n-1)
   local n2 = fib(n-2)
   return n1 + n2
end

fib(30)

local x = 1
while x < 1000000 do
   x = x + 1
end

What could possibly be wrong with my implementation? Can it be the way I used macros in the match expr? I would be very grateful if someone could give some feedback.

Generated byte code, in case useful:
my own: (LoadFunc(dst, len|nargs) loads the function onto the stack, with reference to current context. len is 13-bit, representing the length of the function(in byte code) and nargs is 3-bit, representing the number of arguments to the function)

Literals:

1000000
-------------------------------
Byte Codes: 

PushContext
LoadFunc(1, 233)
PushContext
SetLocal(0, 0, 1)
GetLocal(1, 0, 0)
LoadInt(2, 2)
Less(1, 1, 2)
JmpNot(1, 3)
GetLocal(0, 0, 0)
PopContext
Ret
GetLocal(2, 2, 3)
GetLocal(3, 0, 0)
LoadInt(4, 1)
Sub(3, 3, 4)
Call(2, 1)
SetLocal(0, 1, 2)
GetLocal(2, 2, 3)
GetLocal(3, 0, 0)
LoadInt(4, 2)
Sub(3, 3, 4)
Call(2, 1)
SetLocal(0, 2, 2)
GetLocal(0, 0, 1)
GetLocal(1, 0, 2)
Add(0, 0, 1)
PopContext
Ret
LoadNil(0)
PopContext
Ret
SetLocal(1, 3, 1)
GetLocal(0, 1, 3)
LoadInt(1, 30)
Call(0, 1)
LoadInt(1, 1)
SetLocal(0, 0, 1)
GetLocal(0, 0, 0)
LoadConst(1, 0)
Less(0, 0, 1)
JmpNot(0, 5)
GetLocal(1, 0, 0)
LoadInt(2, 1)
Add(1, 1, 2)
SetLocal(0, 0, 1)
Jmp(-9)
PopContext

luac:

main </home/olivercwy/workspace/lua/bin/test_lua/test.lua:0,0> (14 instructions at 0x55dfe59accf0)
0+ params, 2 slots, 1 upvalue, 1 local, 2 constants, 1 function
        1       [1]     VARARGPREP      0
        2       [9]     CLOSURE         0 0     ; 0x55dfe59acfb0
        3       [1]     SETTABUP        0 0 0   ; _ENV "fib"
        4       [11]    GETTABUP        0 0 0   ; _ENV "fib"
        5       [11]    LOADI           1 30
        6       [11]    CALL            0 2 1   ; 1 in 0 out
        7       [13]    LOADI           0 1
        8       [14]    LOADK           1 1     ; 1000000
        9       [14]    LT              0 1 0
        10      [14]    JMP             3       ; to 14
        11      [15]    ADDI            0 0 1
        12      [15]    MMBINI          0 1 6 0 ; __add
        13      [15]    JMP             -6      ; to 8
        14      [16]    RETURN          1 1 1   ; 0 out

function </home/olivercwy/workspace/lua/bin/test_lua/test.lua:1,9> (15 instructions at 0x55dfe59acfb0)
1 param, 4 slots, 1 upvalue, 3 locals, 1 constant, 0 functions
        1       [2]     LTI             0 2 0
        2       [2]     JMP             1       ; to 4
        3       [3]     RETURN1         0
        4       [6]     GETTABUP        1 0 0   ; _ENV "fib"
        5       [6]     ADDI            2 0 -1
        6       [6]     MMBINI          0 1 7 0 ; __sub
        7       [6]     CALL            1 2 2   ; 1 in 1 out
        8       [7]     GETTABUP        2 0 0   ; _ENV "fib"
        9       [7]     ADDI            3 0 -2
        10      [7]     MMBINI          0 2 7 0 ; __sub
        11      [7]     CALL            2 2 2   ; 1 in 1 out
        12      [8]     ADD             3 1 2
        13      [8]     MMBIN           1 2 6   ; __add
        14      [8]     RETURN1         3
        15      [9]     RETURN0

The requisite initial question: how are you running your code? Have you compiled with --release?

I believe luac doesn't do any JIT compilation of the bytecode, so you should be on even footing there.

An order of magnitude actually isn't bad compared to an established optimized implementation, to be honest, but it does seem more than should apply here. My immediate guess is the overhead associated with accessing the context through RefCell. Since interpretation is single threaded, the interpreter should only ever "open" a single context cell at a time, and the bookkeeping would thus be pure overhead. That no panic occurs confirms that your access pattern is valid.

If you want to test how much this overhead actually is, you could temporarily use as_ptr to unsafely access the cell contents and bypass the checks. Do not keep this, but doing so for an execution you've confirmed is valid normally will give you an actual measure of what you're paying for the checks and whether it's worth it to figure out a different ownership design / access pattern that doesn't need the runtime checks.

This does seem a perfect use case for a split ownership cell, since only the interpreter should need to "open" the cells, and only one at a time; possible providers of such are qcell or cell-family.

5 Likes

Thanks for the reply. It is meaningful for me to know that I wasn't doing things terribly wrong. I will look into qcells, but I think it won't affect so much. I did some profiling and observed that the major overhead is the run function (basically the while-match block). This is also observed in this post

It's well-known that loop { match { ... } } performs poorly in interpreters.

A quick search found Interpretizer: A Compiler-Independent Conversion of Switch-Based Dispatch into Threaded Code | SpringerLink (which I didn't read, but the abstract mentions the problem). Perhaps some of its citations might help.

WASM3 has a really elegant solution for this. Instead of looping and matching on the instruction, you'll invoke a continuation function (creates as part of the initial opcode decoding process) to handle the next instruction, then it periodically unrolls the stack and resumes to avoid stack overflows (most functions are TCO-friendly, but a couple opcodes like branch and looping requires storing things on the stack so overflows are still a possibility).

The benefit of this approach is that you never go through the giant switch statement, which makes your branch predictor very happy. It also makes the code a lot cleaner because you can break opcode handling up into small fn(&mut VM) functions.

I strongly recommend reading their architecture docs for more:

I assume this is equivalent to a lookup table?

Na, it's entirely different. You convert the list of opcodes into a list of function references, then when one function is finished executing (i.e. handling the opcode) it recursively invokes the function reference for the next opcode.

The defining feature is that you never need to lookup a "random" address and jump to it, which is what happens when you use a lookup table or take the loop-and-match approach.

Another name for this technique is Continuation Passing Style.

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.