Writing a VM in Rust, mutually recursive functions, memory leak

  1. I am writing a VM in Rust. Suppose we allow mutually recursive functions, so at some point we have something like:
let
  f x y = ... refers to g ...
  g x y = ... refers to f ... 

Then, in the bytecode of f, we must either point to g or point to some env that contains g.

Similarly,in the bytecode of g, we must either point to f or point to some env that contains f.

Now, suppose those language has a REPL.

Does this basically guarantee circular refs if we use Rc?

How do VMs written in Rust get around this issue?

  1. Note: so far, we haven't even decided whether the VM language has a GC or not. All we have decided is that the VM language should (a) support mutually recursive functions and (b) have a REPL.

You can make each function definition an Rc, but each use an rc::Weak. Or you can store all definitions in an array and refer to them by index at use. There's many other ways as well.

4 Likes

Is there a list or a blog post somewhere? I'm happy to invest a day or two and understand all the known techniques with using Rc instead of GC.

Well, that turned out to be rather easy, vm.rs#L3474:

pub fn clear_at_exit(&self, gtab: Rc<RefCell<Map>>){
    self.delay.borrow_mut().push(gtab);
}

This prevents the additional branch in Weak pointer dereferencing. That alone might be a worthless micro-optimization, but weak pointers and indices (which could be considered as a kind of weak pointer) have different semantics than Rc. In my system a function holds shared ownership of its module and the gtab (table of global variables) therein.

1 Like

@Finn : Whoa, you've written an actual VM in rust. I hope you don't mind if I try to pick you brain a bit:

Imagine again we want to support the following language feature

let
  f x y = ... refers to g ...
  g x y = ... refers to f ...

We likely go with the following setup

pub enum Instr { ... }

pub struct FuncRepr {
  name: String,
  instrs: Vec<Instr>
}

pub struct Env {
  funcs: Map<String, FuncRepr>
}

So then we have to deal with the question of "how does f refer to g'.
Our options seem to be:

  1. We can do Instr::OtherFunc<Rc<FuncRepr>> -- but that would cause a cycle. Thus, we need to indirectly refer to g or refer to it weakly. So the options seem to be:

  2. WeakRc to G.

  3. StrongRC to Env, then use String to refer to g.

  4. WeakRC to Env, then use String to refer to g.

=====

The problem with (2) is that we can not just pass f around. We need to always keep track of the Env that defined f, g .. and make sure that Env stays alive. Otherwise, the WeakRc to g may end up pointing at nothing.

Similarly with (4), we still have to manually track the enclosing Env, as the WeakRc by itself is not going to keep it alive.

So then this leaves us with (3), where

  • each function has a StrongRc pointing to the Env
  • each Env has a Map of (String, StrongRc<FuncRepr>) pairs
  • functions refer to other functions via (StrongRc<Env>, String)

When we need to break all cycles / allow for Rc, we just clear all the maps (thus allowing all functions to be freed, and thus allowing all non-pointed-at Envs to be freed).

Does this get the high level bits right?

Yup, that gets to the point.

Of course there is nothing wrong with replacing the map with an array. For example, it does not make much sense to use a map for local variables or closure bindings.

And we know that at some point a RefCell must be placed in, unless some other design is exploited.

More detailed information about the design space of cyclic data structures: