Should I be using multiple ownership for my Symbol Table?

So I have been working on my compiler for my compilers class. We have just gotten done with name analysis.

The way I have implemented the SymbolTable is using a list of hashmaps. Each hashmap represents a scope. When I enter a scope in the AST, I append a new hashmap to the list. When I encounter a declaration, I put a new entry in the current scope hashmap, with the key being the ID, and the value being a Symbol struct which contains the name and type given to that declaration. When I encounter the use of an ID, I look it up in the current scope's symbol table, and if it's not there, look in the second to the last scope in the list, etc. all the way until the first hashmap in the list, the global scope. If I find the symbol, I give the ID node in the AST a pointer to it. Finally, when I exit a scope, I remove the last hashmap from the list but I don't delete all of the Symbols that it contained, because every ID node in the AST which referred to a Symbol will still need to have access to that Symbol.

Currently, my symbol table and my ID nodes share ownership of a Symbol struct via Rc. But I was reading the comments on this reddit post and now I'm wondering, is this good? My reasoning for using Rc in the first place is that obviously the hashmap can't be the single owner of the Symbol it contains, because I may need the Symbol to persist after the hashmap gets popped. Also, it feels kind of natural to use multiple ownership in this case, since I may have multiple ID nodes that will need to reference the same Symbol; that Symbol can only be freed when there are no ID nodes referencing it; and I have no idea how many (if any) IDs are going to reference a symbol until all of the AST nodes belonging to a scope are actually visited.

One alternative implementation I could think of is to have the ID node associated with the declaration be the single owner of a Symbol, and the ID nodes associated with uses be borrowers of that Symbol. I think this would be valid since I don't ever plan on dropping any of my AST nodes unless I'm dropping the entire tree. However, I'm pretty sure this would require me to make some enum like:

enum OwnedOrBorrowed<'a> {
  Owned(Symbol),
  Borrowed(&'a Symbol),
}

For reference, here is the repo. The relevant files are src/name/symbol.rs and src/ast.rs.

1 Like

If I understand correctly, you would run into complications with references because you'll be trying to alter your table as you progress, but will also be trying to store references to things that live inside the table (or at least live inside the table at some point). The compiler won't be able to prove you don't alter or drop the Symbols you reference, and even if it could, creating a &mut path that can reach the referenced data invalidates the reference.

References are typically more for short-term borrows than long-term structures.

I think an Rc (or Arc) is fine here. An alternative would be to have some sort of program-wide storage of Symbols and to store a unique index or key everywhere else.

Incidentally, the enum you describe is std::borrow::Cow.

3 Likes

What's in Symbol and what exactly is an ID? It identifies what? Depending on how cheap they are to clone, you could just clone them.

I won't be altering the table as I progress. Once an entry gets added to the table representing the current scope, it will stay there and will not change until the current scope is exited and the table is dropped. Nor do the Symbol structs themselves need to be mutable at all.

IDNode:

pub struct IDNode<'a> {
    pub name: &'a str,
    pub symbol: Option<Rc<Symbol<'a>>>,
    pub pair: pest::iterators::Pair<'a, Rule>,
}

Symbol:

pub struct Symbol<'a> {
    pub name: &'a str,
    pub typ: SymbolType,
}

SymbolType:

pub enum SymbolType {
    // Every type in the AST, plus a special function variant which
    // encapsulates the argument types and return type
    Int,
    Short,
    Bool,
    Str,
    Void,
    Ptr(Box<SymbolType>),
    Fn { args: Vec<SymbolType>, ret: Box<SymbolType> },
}

An IDNode represents an identifier in the programming language. It contains a string slice which has the actual identifier used, and a reference to its Symbol. For example, if I had a program:

// This is C--, the programming language my class is writing a compiler for
int main() {
  string a;
  a = "hello";
  write a;
  return 0;
}

Then the AST generated for this program would have four IDNodes in it: one for the function declaration main, one for the variable declaration a, and two for the uses of a. When my compiler performs name analysis, it will encounter the declaration of a and add an entry to the symbol table for the current scope. The key of this entry is the name ("a" in this case), and the value is a Symbol struct which will contain the name for convenience as well as the type (Str in this case). Declaring the same variable in the same scope multiple times is an error in C--, therefore entries aren't ever mutated in a hash map once they are initially added.

When a = "hello"; and write a; are encountered, I need to check if a has been declared first, and then I need to look at what the type of a is so that I can see if these are valid operations being performed on it. I do this by looking in the symbol table for a Symbol at key "a", and if I find one, I give the respective IDNodes a reference to this Symbol. This will help me to perform type analysis, so that when I walk the AST doing type analysis and encounter an IDNode being used in an expression somewhere, I can look at the IDNode -> Symbol -> SymbolType and see if it's being used in a valid way given the type.

I imagine that I will also need to refer to these Symbols later when I start generating intermediate codes, so that every use of a given identifier is tied to the same Symbol which is tied to some location in memory.

When I said "ID" in my first post, I meant an identifier in a C-- program, like main or a. When I said "ID node", I meant the struct that my compiler actually uses to represent these tokens: an IDNode. Sorry for the confusion.

Does this make more sense?

Yes. In this case, you could probably get away with just copying everything. If you are worried about copying the Vec in SymbolType::Fn, you could intern (memoize) types.

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.