Idiomatic way to represent bytecode

Hello, I am working through Crafting Interpreters and I’m re-implementing the interpreter in Rust. If you’re curios, here is the repo.

After implementing a tree-walk interpreter now it’s time to compile Lox to bytecode and run it in a VM. In the book, a chunk of instructions is represented by

typedef struct {
  uint8_t* code;
  ValueArray constants; // This is a pool of constant values 
} Chunk;

which has a void writeChunk(Chunk* chunk, uint8_t byte) function used to append a byte to it. For now it has only two instructions:

  • OP_RETURN and
  • OP_CONSTANT, which is always followed by another byte representing the index of the constant in the constant pool.

In Rust, I’d represent the opcodes I have with an enum:

enum OpCode {
    Return,
    Constant(u8),
}

And I’d represent a chunk with:

struct Chunk {
    code: Vec<OpCode>,
    constants: Vec<f64>,
}

This would work, but code would take more space than necessary because the size of OpCodes is 2.

Since I’m working on a toy project this is not an issue at all, but I was wondering what would be the most idiomatic approach to have a nice API that uses OpCode without ‘wasting’ space.

I can think of a solution where I change Chunk to be closer to the C struct:

struct Chunk {
    code: Vec<u8>,
    constants: Vec<f64>,
}

impl Chunk{
  fn write(&mut self, op: OpCode) -> (){
    // TODO: Convert op to the bytecode and add it to the vector
  }
  // TODO: add a function that iterates over chunk and returns
  // something implementing Iterator with Item=OpCode
}

Would that be idiomatic Rust or can I do something better?

1 Like

I think both are idiomatic just tailored/optimized differently. Note that the discriminant storage will amortize itself out some once you start adding more discriminants but of course it won’t beat raw byte storage. And you can potentially start hyper optimizing by packing op codes into bits of the byte stream to get more information density :slight_smile:.

The “right” one would depend on how the rest of the VM looks and what it needs to optimize for.

2 Likes

@mariosangiorgio

As I understand you want to work with code: Vec<u8>, iterpreting code byte by byte, take two bytes if current value is Constant and one byte if Return?

Then I suppose you can look at String which internally uses utf-8, which consept is similar to what you want.
It provides char iterator to work with character,
so you may hold data in Vec<u8>, and give to user iterator that return enum OpCode

Is there any way to represent this bytecode compactly and using it in a efficient way for dispatch? As far as I know, Rust compiler is optimizing enum matches better when all of the variants are handled. At least it is capable of that, because language reference considers invalid values for enums as an Undefined Behaviour. I can use an [u8] for bytecode and handle every bytecode by hand, but I have to handle unused integers with a _ => (), which is unnecessary for me. Maybe using an unreachable_unchecked will make compiler ignore unused branches.

Also, Rust doesn't have goto and I was not able to find any way of using computed gotos for dispatching. Maybe Rust is not a good language for making fast interpreters yet.

Any interpreter that is worth the "fast" qualifier nowadays wouldn't be a bytecode interpreter, it would be a JIT, one that generates native code and optimizes it on the fly. That is possible (and approximately equally painful) in any sufficiently low-level language that provides access to raw memory and something like mprotect() – Rust is no exception.

Trying to use "computed GOTOs" and "threaded interpreters" and stuff like that is a slice of the past. It won't result in significant speedups. (I guess an interpreted language could get way more speedup by being typed and thus not having to reference count/GC primitive types all the time.)

If you want to write a toy bytecode interpreter for learning purposes, that's fine, and I'd recommend you go with the cleanest, highest-level approach which is probably enums. Don't bother with micro-optimizations, they are not going to pay off. If you need a fast interpreted language, use a modern JIT-ing JavaScript engine.

1 Like