How to structure this program for my use cases?


#1

Hi there,

I am currently working on a toy-compiler to learn about Rust.
Aaaaand the problems with the borrow- and lifetime-checker have already started! :smiley:

GitHub Repo: https://github.com/Robbepop/cionc/tree/master/src/parser
In my case I have a Lexer that streams its output as Tokens.
Later I am planning to implement a Parser that it transforming these Tokens into an abstract syntax tree etc. just like a default compiler would work these days.

Problems I am currently facing:

  • Tokens shall store &str with important information, e.g. an Identifier should store here its name and a number should store its textual value. The &str shall be a reference to a String object stored within the so-called StringTable which is a thin wrapper around a HashMap<String,String> (better would be HashSet but this currently doesn’t work in the standard). The problem is that I can not have references outside of the StringTable to its contents (in the Tokens) and mutating the StringTable simultaneously and I do need that in my current design.

  • The second problem is that I wanted to have a central management unit that is called the CompileContext which stores several helper utilities requires in different stages of compilation like the StringTable, an ErrorHandler or the SymbolTable etc. Lexer and Parser would then store a reference to this CompileContext entity and would have access to all of its members. This however, is critical since both, Lexer and Parser, may co-exist and thus may borrow things from the CompileContext at the same time.

My final question is: How can I fix these problem in the way Rust is meant to work?

Regards,
Rob


#2

This cannot work in a generic way because modifying an object is allowed to invalidate interior pointers. For instance, std's HashMap will move the objects that it stores whenever it rehashes; this won’t affect the current String which stores data at fixed addresses on the heap, but if you used a version of String that stored values shorter than three pointers inline, then it would invalidate references.

If you want to create a large number of objects and have them referenceable until the end of the compilation cycle, you can use an arena structure (example: typed_arena); by adding strings to the arena, you get references with the same lifetime as the arena itself, which you can then store in the symbol table.

“God objects” are often a design smell and this could indicate a factoring problem (how do the lexer and parser call each other? There’s probably someone else here with more experience factoring compilers in Rust than me). That said, if you need mutable state accessible by many different paths, RefCell is for you. Store your global mutable state in a RefCell and then make shared borrows from every object that needs it.


#3

Hey, thank you very much for your response.

“but if you used a version of String that stored values shorter than three pointers inline, then it would invalidate references.” - oh! I didn’t know that Rust uses short-string optimization. This is awesome thought! :smiley:

However, am I right that using HashMap<String, Box> should solve my problems for iterator invalidation if I have references to my values within the HashMap<String, Box> from the outside?
I can not imagine that the typed_arena can really help me since the reason for the StringTable is to just have every equally comparable string only once. So if I store a “Hello” in my StringTable and want to store another instance of a “Hello” into it, then I get the old “Hello” returned as reference instead.

In my current implementation I am already using the RefCell as wrapper for the StringTable. However, I am getting confusing error messages based on lifetimes when calling the “get_string_table()” method:

lexer.rs:113:17: 113:35 error: cannot infer an appropriate lifetime for autoref due to conflucting requirements

This error isn’t very helpful for me to solve the problem in my code though … :frowning:


#4

The built-in String doesn’t, but you could implement or import your own that does, and the compiler has to be prepared for this.

My idea was that you would keep the HashSet but not store the physical strings in it. Instead, store the strings in an arena, and then use a HashSet<&'arena str>; you can then intern a string by doing a lookup (since references in Rust implement Eq and Hash by value equality, not pointer equality), and inserting into both the arena and the hashset on miss.

Alternatively, dispense with the arena and just have references into your original string. If this is for identifiers, they’re probably already in the original string somewhere.

Alternatively alternatively, dispense with borrowing and just use String or Rc<String> in your token type. It might not be as fast to run (or it might be — premature optimization is the root of all evil after all), but it’d let you finish the project faster. :smile:

Agreed. I still haven’t gotten the hang of lifetime error messages myself. Adding more annotations to force a specific error helps sometimes, but not often.


#5

Hm, guess I have to figure out how to work with the type_arena and the HashSet in the mentioned way, then.
I hope the API for both are allowing for such a use case.

My current codes at github ( https://github.com/Robbepop/cionc/tree/master/src/parser ) in lexer.rs (line 107) as well as in compile_context.rs (line 24) are still errornous and I don’t get why.

The compiler states:
error: cannot infer an appropriate lifetime for autoref due to conflicting requirements [E0495]

I simply don’t understand what this error means at all and can’t even find useful information about it on the web.

Maybe you can help me here?

Thanks!


#6

Maybe get it working first using Rc<String> ? At least you’d have far fewer lifetime parameters to track that way.


#7

I could do that but this doesn’t learn me why the current implementation doesn’t work and I will tap into the same trap in the future again.

The String class doesn’t implement the Copy trait that I use for Token which is a nice convenience thing that I do not want to lose. I could create a String wrapper class but I hope I do not have to.^^

The Rc within the Token could for sure solve the Problem of lifetimes and I will try that out once I am sure about what causes the current implementation to malfunction. =)


#8

You cannot create a wrapper class around String to make it Copy. Types which are Copy are not invalidated when they are moved. If you could copy String, you’d have multiple owners of String’s heap allocated data, which would result in use after frees and double frees.

The problems you are running into seem to me to be because you are trying to do things that Rust fundamentally disallows (because they are potentially unsafe). What you’ve written is very complicated, its not common to see programs with this many lifetime annotations, and though I haven’t understood your program fully enough to understand the errors you’re currently getting, I suspect that its very likely you’re trying to do something impossible.

I think going with the shared ownership approach of using Rcs is a good idea. Then, once the program works, you can find ways to refactor out the Rcs in a more gradual way, rather than trying to tackle this much complexity at once.


#9

I found out that I have used several lifetimes incorrectly which finally led to those wrong and confusing errors.
However, I have still a lifetime problem with the same chunk of code in lexer.rs at line 113.
Github-Link: https://github.com/Robbepop/cionc/tree/master/src/parser

The current error:
error: borrowed value does not live long enough
pointing to “self.context.get_string_table()…”

Further error-info:
note: reference must be valid for the lifetime 'ctx as defined on the block at 107:46
note: … but borrowed value is only valid for the block at 107:46

self.context.get_string_table() returns a RefMut<'self-lifetime, StringTable> which I then use the get_or_insert(...) method on. I don’t know exactly what lifetime is errornous here but it seems to be something with the RefMut I guess.

I will go for the Rc<String> solution (which is only semi-optimal) when everything else fails and when I finally know what causes the errors here. :slight_smile:

Why do I do this? I simply don’t want to be afraid of Rust’s lifetimes and try to avoid them instead of understanding them.