Implementing struct / enums in an interpreter

Suppose we are writing an interpreter, in Rust, for a language with user definable structs. Say, a tiny mini sorta-C interpreter.

Suppose we want to support user defined structs in the interpreter.

So in particular, these "user defined structs" have structure that is NOT know at the compile time of the interpreter, but only known at the RUNTIME of the interpreter we write.

A very inefficient way to do this would be:

pub enum UserValue {
  U8(u8), I8(i8), ..., U64(u64), I64(i64),
  Compound {
    fields: HashMap<String, UserValue>

In particular, this has horrible memory locality and looking up fields requires a string hash/comparison.

A better way would be to pack these user defined structs as a Vec with some sideinfo on where the fields are located. This however, looks like a pile of unsafe on unsafe.

Question: Are there known idiomatic solutions to this problem? If so, what crates should I look at ?

I don’t see why this has to be unsafe. You could, for instance, save type-erased getter and setter functions for each field that use nom to read from/write to the appropriate buffer position. You might also want to compile to some kind of bytecode so that the field-name string comparisons happen at compile time instead of runtime.

That is a good point.

  1. I do not have a mathematical proof that it must use lots of unsafe.

  2. It is not obvious to me how to do this in a safe and efficient manner. My intuition is that there will be quite a bit of byte-munging and casting involved.

  3. I would love to be proved wrong by construction. In particular, in asking

I hope there is a fast & clean solution.

To further this intuition:

  1. In my current understanding, when Rust code interfaces with C structs via ffi, we need unsafe.

  2. In the Vec<u8> approach, we need to be able to allocate, initialize, read/write fields, and de-allocate things sorta like C-structs.

  3. Because of #2, my intuition is "lots of unsafe is required for this."

  4. I see how one could make the counter argument "well, this is different since Rust is doing reading, writing, allocation, and has more info" -- but it is not obvious to me how to do this in code (nor have I seen any crate / tutorial for doing this.)

The key difference is whether you will be storing real pointers (in any form: references, boxes, etc.) in your custom structs. If you do, you will need unsafe. If you don't, you won't.

(Pointers in the interpreted language that are actually indexes into your interpreter's storage don't count.)

Here is one way:

enum UserValue {

When you compile the code, translate field references to indexes into the Compound Vec.

1 Like

FYI - Python objects are backed by a special hashmap where obj.x is translated to obj.__dict__["x"] (plus/minus some hooks like __getattribute__), and its performance characteristics are often good enough.

Also, if all strings are interned by your runtime, your string hashing and comparisons turns into a trivial std::ptr::eq() call.

If the hashmap lookup is still too expensive, you could add a small-size optimisation which uses a Vec<(String, UserValue)> and linear search when an object only has a few of fields/methods (e.g. 20).

KISS, and all that.


The issue here is that structs are basically all heap alocated, and by Vec<u8>, I was hoping for way to store most things inline.

Vec<u8> is also heap-allocated, just like Vec<UserValue> is. I am not sure what you're saying.

This is my fault for never formally stating this earlier, but by 'efficient', I was hoping that something like a.b.c.d gets compiled, by the interpreter, into bytecode that just jumps to the right memory location in one step.

Imagine we have a:

pub struct Foo {
  x: i32,
  y: i32,
  z: f32

in a Vec<u8> rep this would just be 12 bytes. So I would prefer something where we just increase the stack pointer by 12 bytes, and use it; whereas in the approach you suggested, we would allocate a Vec.

Sorry if Vec<u8> is the wrong term -- would be a better description of this?

I think one thing we would both agree is that the way "C" lays out structs and the way your proposal lays out structs is not the same, and in the "C" approach, everything is one contiguous block of memory, whereas your approach allocates stuff on the heap.

pub struct A {
  a: i32

pub struct B {
  a: A,
  b: f32

in a "C' like approach, B is 8 contiguous bytes, in your approach, B::a would be a Vec of 1 elem on the heap

OK in that case you can't use anything like the original enum because enums have padding. Just allocate one big Vec<u8> for your stack, have a stack pointer, etc, and load values by using things like u64::from_ne_bytes (which doesn't require unsafe).

But if you're doing things like that then you're really micro-optimizing your interpreter, so it's no longer a "tiny interpreter".

This is generally not possible when a type's layout isn't known until runtime unless you have a JIT.

The reason we can turn a.b.c.d into a bit of pointer math is because we know the type of a (and every property, recursively) and that the user won't add new properties to the object later on. This is prohibited by the type system for statically typed languages like Rust, and JITs (e.g. V8 and SpiderMonkey for JavaScript) insert special type checks and de-optimisation code to make sure the pointer math is always sound.

Well... his interpreter presumably knows the type of a, assuming the language is statically typed, so it can do it. I guess every interpreter is sort of like a JIT with a virtual machine.

  1. What @tczajka said.

  2. By "interpreter" -- and again, my terminology may be loose here, I was thinking: "something that takes as input &str and produces as output Vec<BytecodeInstr>" . You might be calling this a "JIT" (though I think it's too simple to be a "JIT").

  3. "a.b.c.d" , given that we know the types A, B, C, D already can be compiled down to

addr(a) + (offset of b in A) + (offset of c in B) + (offset of c in D)

the addr(a) part we will only know when interpreting the BytecodeInstr

The (offset of b in A) + (offset of c in B) + (offset of c in D) part is a constant we can compute at the "&str -> Vec<BytecodeInstr>" stage

I'd call that a bytecode compiler, maybe more commonly abbreviated to "byte compiler". E.g., GNU Emacs has "byte compilation" of its scripting language (albeit now superseded by native compilation):

I guess your thinking goes like this: “real JIT have to create a machine code and I only just want tiny bytecode which I would then interpret”. But the truth is: 90% of work writing JIT which compiles to machine code is creation of memory model and appropriate bytecode! After you have it generating machine code (not a veryefifs more-or-less trivial.

Now, if you want to go further and add optimizations to that compiled code including profiling and other stuff… then it becomes much more complicated, sure.

But going from Vec<BytecodeInstr> to machine code is trivial and obvious step.

That's is why people are not trying to provide safe crates which make this step easy: since the next step (actual generation of machine code) is, by definition, unsafe… why try to avoid it in the first step?

My intuition, which might be wrong, is that the hard part of JIT is the optimizations part, as well as tracing and hot spotting; and that the gap between a "dumb bytecoder compiler" and a "JIT" is huge.

Ah okay. When your original post said the definition of a struct is only known at runtime, I interpreted that as being able to define new classes or add new properties while the code is being executed (similar to JavaScript or Python).

I guess the point about JITs was less about the JIT compilation to machine code and optimisation, and more about how the runtime needs to understand all the types that are available when a program is running so we can generate those fast accessors during execution and when an object's properties are changing.

The major thing about bytecode interpreters is that bytecodes are (generally) not statically typed. Instead, you'll have instructions like "read property X from the object at the top of the stack", and leave it up to the interpreter to figure out access and types. That almost always necessitates a hashmap or similar data structure that looks things up dynamically and stores data on the heap.

That said, if your programming language is statically typed, you know all the properties and types so the bytecode compiler can turn a.b.c into something like "load the variable at the top of the stack, push the element in the object's second slot onto the stack, push that object's third slot onto the stack".