Best way to represent AST?

I am implementing an interpreter like [Structure and Interpretation of Computer Programs, 2e: Chapter 4] with a notion of syntax encoded in an AST. As I've defined it, the syntax involves a lot of heap lookup:

struct Expression {
  Plus(Box<Expression>, Box<Expression>),
  Minus(Box<Expression>, Box<Expression>),
  Lambda(...)
  ...
}

This means that when I am trying to access an in e.g. a loop, there will be many references to the heap. Is it possible to do this more efficiently?

Have you measured the impact and determined that the heap access is a problem?

In any case, this blog post provides some ideas, but the code is not as idiomatic I'd say: Elegant and performant recursion in Rust :: [ recursion.wtf ].

1 Like

You don't have to box each individual field. You can eg. define a BinaryExpr struct that contains two Expressions directly, and only box that when necessary.

But in general, you can't build arbitrary, dynamic, recursive trees without indirection (and managing ownership properly likely means that this kind of indirection will always involve heap allocation).

The most expensive parts of compilers usually aren't the AST traversal. Reasonably-sized code bases (a couple 1000s or tens of 1000s of lines) can be parsed and traversed in a few dozen milliseconds on modern hardware. The AST won't be the bottleneck. (Just like traversing the DOM is usually not the bottleneck while rendering a web site, even though HTML pages usually have much more DOM nodes in general than AST nodes in a program).

Slowness kicks in when you start performing sophisticated analyses on the code and/or want to optimize it. For example, lexical and syntactic analysis of Rust is a tiny fraction of the compiler's running time. Most of the time is spent in type checking by rustc and then optimizing the emitted IR in LLVM. Oh, and yeah, linking, if that weren't surprising enough.

6 Likes

But when is necessary? Suppose the syntax is just Add(A, B), Multiply(A, B), and Leaf Numbers. Won't that require Add to be Add(Box, Box)? How can you do it otherwise?

That is exactly what I described in the first paragraph of my previous post.

struct BinOp {
    lhs: Expr,
    rhs: Expr,
}

enum Expr {
    Add(Box<BinOp>),
    Mul(Box<BinOp>),
    Num(f64),
}
2 Likes

This halves the number of references, but is there an easy way to significantly reduce it? If there are many expressions of different length, it seems optimal to allocate blocks of size 2^N to balance lookup cost against allocation space (never implementing more than 2x memory required).

Is there an ergonomic way to do this in Rust?

Doing pointer arithmetic into a single block is not more expensive than doing pointer arithmetic into individually-allocated nodes. If you want to amortize the cost of allocations, at the expense of having to keep the whole tree owned at all times, then use an arena.

If you care about interpretation performance, you could add a pass where you convert the parse tree to byte code.

1 Like

you shift the indirection/recursion one layer higher or lower, depending on your perspective. its like how you can define a tree node structure in different manners, e.g.:

enum Node<T> {
    Branch {
        children: Vec<Box<Tree>>,
    },
    Leaf {
        value: T
    }
}
struct Tree<T> {
    root: Node<T>,
}

vs

enum Tree<T> {
    Null,
    Branch {
        root: Box<Branch>
    },
    Leaf {
        value: T
    },
}
struct Node<T> {
    children: Vec<Tree<T>>,
}

this kind of transformation can be applied to mutually recursive types.

1 Like

Let's see how the rustc handle it. It is a practical compiler project, right?

What the heck is the P?

Ok, cool.

9 Likes

It's probably not worth worrying about at this point in time. Or at all, really.

Don't forget that most managed languages (Java, JavaScript, C#, etc.) use heap-allocated objects for essentially every single object they create, yet their performance is good enough on modern CPUs to implement things like the TypeScript compiler and the C# compiler in.

The best[1] AST design I've come across so far is un-typed syntax trees in the style of rust-analyzer.

It's a bit hard to describe, so I'll link to some references...

Matklad, the original author of rust-analyzer created an amazing tutorial on implementing a production-grade resilient parser. 10/10 would recommend.

Here is a theoretical explanation of how un-typed syntax trees work:

If you are more visually inclined, here is the first stream he made when explaining how rust-analyzer deals with syntax. That, and the next 3 or 4 videos are all focused on different parts of the syntax tree representation, parser, and parser test suite.


  1. In my opinion, anyway. Things I care about are

    • Simplicity and maintainability of the parser
    • Resilience and the ability to report multiple syntax errors
    • Retaining all information from the source code so you can do things like programmatically read comments
    • Structural sharing or reference-counting for nodes (makes cloning trivial, which lets you avoid contaminating code with lifetime annotations)
    ↩ī¸Ž
4 Likes

Sorry I meant, how can you do it other than the ways we both said? We're both using O(...) the same number of boxes. It would be great to dump a simple formula on the stack and then work with it there rather than referencing the heap a lot.

These answers have all been incredibly helpful, thank you.

It's irrelevant if it's heap or stack, what matters is cache locality. Which is why people mentioned arenas.

4 Likes

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.