How to structure my code for AST processing

I'm looking into rewriting a compiler in rust at work. Currently it is in JS and we are having performance issues.
I got some experience with rust, but by no means an expert.
I got a lexer and parser working and now I'm working on the next part where we will be mutating the AST ( resolving references, injecting default values etc...)

My data structure so far is I got an array of enums for my nodes and links between the nodes are represented as index into this vector.

struct AstEntity {
	parent: usize
	children: Vec<usize>,
	//... more specific fields
}

struct AstProperty {
	parent: usize
	value: usize,
	//... more specific fields
}

enum Node {
	Entity(AstEntity),
	Property(AstPropert),
	//...more node types here
}

nodes: Vec<Node>

The problem now is that I need to walk this array, mutating nodes as I go, inserting nodes, adding and removing links etc.
This obviously causes issues with the borrow-checker. As an relative newbie I'm tempted to reach for RefCell here, something like this:
nodes : Vec<RefCell<Node>>, maybe even nodes: RefCell<Vec<Refcell<Node>>>, but maybe there is a better way?
Any tips for working with data structures like this?

It looks like all of your nodes can be made Copy, since all memory resources are represented via indices anyway. In that case you don't need to hold live references to the backing vector for any extended amount of time, and no borrow-checker issue arise. Is that a workable solution for your use case?

Alternatively, why don't you use memory allocations for your node (Box, Rc)? If you're writing a compiler, you'd likely need transforming passes anyway, which would mean your Vec either becomes non-contiguous (i.e. many elements will be empty, e.g. None), or each pass would need to create a new vector of nodes. It's quite possible that just using Boxed nodes with a performant memory allocator, like jemalloc, would be enough for your use case. I believe it's good enough for rustc.

Even if you decide to go with an arena-based approach (basically storing everything in a vector), you may consider using a proper arena allocator like bumpalo, instead of explicitly messing with indices. Alternatively, you could use a crate which implements a graph data structure, like petgraph (I don't know whether it specifically would fit your use case).

1 Like

Making the nodes Copy is probably the cleanest solution, but I worry about the performance. Some of the nodes have a lot of text in them ( which I didn't show unfortunatly), and copying that probably would slow us down. I might convert String to Rc<str> then I can make the nodes Clone instead. Will see if that works. My lexer/parser is currently 23x faster than the JS code, and I would like to keep that sort of speed going forward. We have quite large documents (the biggest are 31K lines currently) and speed is very important for us.

One of our features that we use a lot and that is very slow in the JS version, is that we have nodes that reference other nodes. In JS we have to locate the target node, then deep copy it. I was hoping to use the indices for this, since then resolving a reference is just updating a number. But if I put all my nodes inside Rc then its still just a simple Clone to resolve a reference. And then the nodes can own their own children, and I don't need the array. Worth exploring.

About my array becoming non-contiguous, thats a good point. Not sure its a problem yet ( we mostly add to and extend the tree) but worth having in mind.

I'll explore jemalloc and using Box and Rc more. Seems like a better approach. If I stay with the array, I'll have to check out a proper arena based allocator. I'm still very much in the exploring phase so all options are still open.

If your nodes contain Strings then you cannot implement Copy for them as it will result in a compile error.

1 Like

31K lines is downright tiny. Pretty much any parser you can write in Rust will be fast enough for that. How much LoC do you need to parse per second? syn uses copious allocations and a very simple lexer, yet it still can parse a couple MLoC per second.

You have an odd terminology. "Resolving a reference" usually means that you find the object that the reference points to. It's a read-only operation (unless you cache somthing) which doesn't require updating any indices or pointers, or doing any deep copies.

Note that an object inside Rc is immutable (more specifically, you can get only &T from an Rc<T>). This means that if you want to mutate your nodes, which you likely do, you'd need some sort of interior mutability, e.g. Rc<RefCell<T>>. But I wonder, do you really need Rc? Nodes in a parse tree are typically referenced only once, so using Box<T> is usually enough. Rc can be useful if you somewhy want to share parts of the tree between different trees, but that seems a bit of a niche case.

Rc may be more useful for storing stuff like literals and identifiers, which are more commonly referenced from multiple places. It could act as simple string interning, with a HashSet<Rc<str>> used to avoid creating multiple allocations for the same identifier.

If you can keep the entire file in memory until you're done working with the parse tree, you can consider representing the textual objects (identifiers, string literals etc) as &'source str, where 'source is the lifetime of the source text. That would mean that creating and accessing text is fast, and the nodes can be kept Copy. You'd need to keep around the source text, but it's usually not a burden. You can even consider putting it in a thread_local! storage, if you can afford to use thread per input file.

1 Like

Give a look at compiler/src/compiler/util/shared_array.rs at master · hydroper-jet/compiler · GitHub

I've a mutable SharedArray that acts like a Vec and can be constructed via the shared_array! literal. (#![feature(decl_macro)])

I've used it only in my semantic model though.