I am implementing the programming language from the book Crafting Interpreters in Rust, and it got me thinking about garbage collection. In the book, the author implements a bytecode interpreter using C, and builds a regular garbage collector. In my implementation, values are represented in rust as an enum whose variants contain their actual representation like so
enum Value {
Number(f64),
Boolean(bool),
Nil,
String(String),
...
}
My stack is represented as a Vec<Value>. I want to store variables local to functions directly on the stack. This way, I can pop them whenever a function finishes executing. The only longer-lived variables will be global ones, which will be stored on a separate data structure. Since Rust automatically frees memory for heap-allocated values when their owner goes out of scope, does that mean I'll get some sort of "GC" for free? My intuition is that whenever I pop things from a function frame their memory will get cleaned up automatically and therefore the only variables I would actually need to worry about cleaning would be the global variables. And it does not seem wrong to me in first sight to never clean global variables since they could be used anywhere in the program. Is my intuition correct or will I actually have to implement some sort of GC myself in this language? How do dynamic languages implemented in Rust usually handle memory allocation?
Academically, “garbage collection” is sometimes used to refer to any automatic memory cleanup strategy. But when people say “GCed language” they usually mean one whose implementations have a garbage collector capable of deallocating cycles of objects — A points to B points to A. You don't get that for free in Rust, because it requires an analysis of the entire graph of pointers between many heap-allocated objects, but Rust only provides you local analysis ("drop this T when the Rc<T> has zero pointers") which will only ever deallocate things that are not in cycles.
If you want to implement this kind of garbage collector in Rust you will end up needing to design your structures to be compatible with it (to be able to ask "what pointers do you contain" of everything on the heap) and use some unsafe code to allocate and deallocate things (just like Rc has unsafe code to allocate and deallocate things).
Manual unsafe isn’t strictly required. You could structure the heap as a HashMap<usize, Box<dyn Any>>, for instance, and store ids instead of pointers. There’s probably also some clever way to write cycle-breaking code on top of Rc, but I don’t know what it would look like.
That makes a lot of sense, I didn't think about reference cycles.
Couldn't I just call the drop method on cyclic references I decided to collect ? I'm thinking if A and B both point to each other and I determined somehow that they are not needed anymore, wouldn't drop(A) then drop(B) clean up both their resources?
If something is shared — via Rc or Arc, then drop(rc_to_a) just drops one of the shared pointers, that isn't the pointer that's part of the cycle, so it doesn't affect the cycle.
You would have to introduce a means to break the cycle, such as making both of the objects store Option<Rc<TheOtherThing>> so that you can set the Option to None, and the garbage collector would need to know to do that to the particular object.
This is the challenge of building a “GCed language” — you need a common protocol for detecting and cleaning up cycles that applies to all objects (except those that you know statically cannot participate in a cycle; e.g. a String cannot since it contains only text).
FWIW, here's a quick-and-dirty stop-the-world GC implementation to get you started. Anything you want to store on the heap needs to impl Trace. You can store it there via Heap::alloc(), which returns a HeapRef that you can use to access the value again.
When it comes time to collect garbage, call Heap::collect(), mark() all of the heap references on the stack, and then finish() the collection.
(untested, but compiles)
use std::collections::{HashMap,BTreeSet};
use std::any::{Any,type_name};
use std::marker::PhantomData as PhD;
use std::cell::RefCell;
pub trait AnyUpcast {
fn as_any(&self)->&(dyn 'static+Any);
fn type_name(&self)->&'static str;
}
impl<T:Any> AnyUpcast for T {
fn as_any(&self)->&(dyn 'static+Any) { self }
fn type_name(&self)->&'static str {type_name::<T>()}
}
pub trait Trace: AnyUpcast+'static {
/// Calls `vac.mark()` on any `HeapRef` reachable from `self`
fn trace<'a>(&self, vac:&Vacuum<'a>);
}
#[derive(Copy,Clone,Eq,PartialEq,Debug)]
pub struct HeapRef<T> {
id: usize,
ty: PhD<fn(&Heap)->&T>,
}
#[derive(Default)]
pub struct Heap {
data: HashMap<usize, Box<dyn Trace>>,
next_id: usize
}
impl<T:Any> std::ops::Index<HeapRef<T>> for Heap {
type Output = T;
fn index(&self, id:HeapRef<T>)->&T {
let any:&(dyn Trace) = &**self.data.get(&id.id)
.expect(&format!("Unknown id {}", id.id));
any.as_any().downcast_ref()
.expect(&format!("Expected type {}, found type {}",
type_name::<T>(), any.type_name()))
}
}
impl Heap {
pub fn alloc<T:Trace>(&mut self, val:T)->HeapRef<T> {
let id = self.next_id;
self.next_id += 1;
self.data.insert(id, Box::new(val));
HeapRef { id, ty: PhD }
}
pub fn collect(&mut self)->Vacuum<'_> {
Vacuum { heap:self, pending: RefCell::new(BTreeSet::new()) }
}
}
pub struct Vacuum<'a> {
heap: &'a mut Heap,
pending: RefCell<BTreeSet<usize>>,
}
impl<'a> Vacuum<'a> {
pub fn mark<T>(&self, id: HeapRef<T>) {
self.pending.borrow_mut().insert(id.id);
}
pub fn finish(self) {
let mut marked = BTreeSet::new();
while let Some(id) = self.pending.borrow_mut().pop_first() {
if marked.insert(id) {
self.heap.data.get(&id).unwrap().trace(&self);
}
}
self.heap.data.retain(|&id, _| marked.contains(&id));
}
}
This is one of those things that is very easy to get the simple thing done, and extremely hard to get done right (handling concurrency especially). Hopefully at least some of the above is useful, though!
I’ll second this. GCs have a tendency to get horribly complex very quickly. Even more so than most software projects, the lower you can make your requirements, the more likely you will be able to succeed. Unless the GC is the main focus of your project, you should probably leave out any GC feature that you can work around not having.