Idiomatic way to represent bytecode


#1

Hello, I am working through Crafting Interpreters and I’m re-implementing the interpreter in Rust. If you’re curios, here is the repo.

After implementing a tree-walk interpreter now it’s time to compile Lox to bytecode and run it in a VM. In the book, a chunk of instructions is represented by

typedef struct {
  uint8_t* code;
  ValueArray constants; // This is a pool of constant values 
} Chunk;

which has a void writeChunk(Chunk* chunk, uint8_t byte) function used to append a byte to it. For now it has only two instructions:

  • OP_RETURN and
  • OP_CONSTANT, which is always followed by another byte representing the index of the constant in the constant pool.

In Rust, I’d represent the opcodes I have with an enum:

enum OpCode {
    Return,
    Constant(u8),
}

And I’d represent a chunk with:

struct Chunk {
    code: Vec<OpCode>,
    constants: Vec<f64>,
}

This would work, but code would take more space than necessary because the size of OpCodes is 2.

Since I’m working on a toy project this is not an issue at all, but I was wondering what would be the most idiomatic approach to have a nice API that uses OpCode without ‘wasting’ space.

I can think of a solution where I change Chunk to be closer to the C struct:

struct Chunk {
    code: Vec<u8>,
    constants: Vec<f64>,
}

impl Chunk{
  fn write(&mut self, op: OpCode) -> (){
    // TODO: Convert op to the bytecode and add it to the vector
  }
  // TODO: add a function that iterates over chunk and returns
  // something implementing Iterator with Item=OpCode
}

Would that be idiomatic Rust or can I do something better?


#2

I think both are idiomatic just tailored/optimized differently. Note that the discriminant storage will amortize itself out some once you start adding more discriminants but of course it won’t beat raw byte storage. And you can potentially start hyper optimizing by packing op codes into bits of the byte stream to get more information density :slight_smile:.

The “right” one would depend on how the rest of the VM looks and what it needs to optimize for.


#3

@mariosangiorgio

As I understand you want to work with code: Vec<u8>, iterpreting code byte by byte, take two bytes if current value is Constant and one byte if Return?

Then I suppose you can look at String which internally uses utf-8, which consept is similar to what you want.
It provides char iterator to work with character,
so you may hold data in Vec<u8>, and give to user iterator that return enum OpCode