Is there a better way to represent an abstract syntax tree?

Hello, I'm working through Crafting Interpreters and I am reimplementing the interpreter described in the book in Rust.

I got to the point where I need to define the type for an abstract syntax tree and the compiler is (rightfully) telling me that my original attempt was creating a recursive type with infinite size.

The error message has been helpful enough suggesting to use Box to prevent the issues, and in fact the following enum compiler just fine.

pub enum Expr{
  Unary(Operator, Box<Expr>),
  Binary(Box<Expr>, Operator, Box<Expr>),
  Grouping(Box<Expr>),
  Literal(Literal)
}

I am wondering if that's the best representation I can get to or if there is something I am missing to implement it in a better way.

4 Likes

That's basically how rustc's own internals look. Depending on your use cases when working with the AST you might want to swap in reference counted types rather than Box, but that's basically it.

4 Likes

Great! Thanks for your reply!

1 Like

Here are a couple of other options:

Instead of Binary(Box<Expr>, Operator, Box<Expr>) you can do something like

enum Expr {
    ...
    Binary(Box<BinaryExpr>)
}

struct BinaryExpr {
    op: Operator
    lhs: Expr, rhs: Expr,
}

This allows you to implement specific methods for BinaryExprs, and also allows to keep the size of the Expr itself small. I've used this representation here: https://github.com/matklad/miniml/blob/master/ast/src/exprs.rs

Another representation, which is used in the context of IDEs, when you want to have a lossless representation of the source code, as opposed to the abstract syntax tree, is a two-layer one.

On the first layer you have an untyped syntax tree, which basically marks ranges in the source text:

struct NodeType(u32);

struct Node {
  type: NodeType,  
  text_range: (usize, usize),
  children: Vec<Node>
}

on top of that, you have a typed representation, which delegates to the syntax tree

struct BinaryExpression {
    node: Node
}

impl BinaryExpression {
    fn lhs(&self) -> Expression { Expression { node:  self.children[0] } }
}
5 Likes

I like this idea. I'll give it a try. Thanks!

I'm 5 years late, but for posterity, another way is to use Arenas (like bumpalo) or allocate directly on stack frames using a layout like this:

enum Expr<'a>{
    Unary(Operator, &'a Expr<'a>),
    Binary(&'a Expr<'a>, Operator, &'a Expr<'a>)
    Literal(Literal)
}
// Arena Allocation (pass arena around to use these expressions)
let arena = bumpalo::Bump::new();
let literal: &mut Expr = arena.alloc(Literal("true".into()));
let negation_expr = arena.alloc(Expr::Unary(Operation::Negate, literal));

// Stack Allocation (expressions are dropped at end of stack frame, so recursion is needed to use these effectively)
let stack_literal = &Literal("false".into());
let stack_expr = Expr::Unary(Operation::Negate, literal);

Just make sure to pass around the arena to all your functions that need to allocate and keep in mind that nothing is deallocated until you drop the arena.
Using this strategy is really useful if you are doing many, many, operations where you don't need to deallocate a bunch of stuff. It saves so much time in heap allocation because it preallocates the heap. In fact, the rust compiler uses arenas in many cases Memory Management in Rustc - Guide to Rustc Development. The only downside is that you need to annotate everything with lifetime parameters which can be annoying.

2 Likes

Thank you for writing up that reply, but FYI we don't typically revive old threads on this forum (e.g. OP hasn't been active for 3+ years).

(I'll close this thread.)

1 Like