Looking for advice on writing a compiler

Hello, Rustaceans! I’m having two major problems.

  1. I only have half an idea what I am doing.
  2. I’ve picked a massive project to work on.

I’m fairly new to Rust. I’ve read the official book, and written enough to feel comfortable with the language. Past that, I do have several years of programming experience. However, that's through school, so nothing professional, large scale, or necessarily well structured.

With that level of experience, naturally my instinct was to jump in the deep end.

My goal is to write a SugarCube compiler and language server. Just saying that implies a certain level of complexity, but SugarCube is a matryoshka language. It’s for writing interactive fiction that can be published on the web, and it does so by building on top of rather than replacing. That means normal HTML, JavaScript, and CSS as well as an additional markdown syntax and story macros.

The following is completely valid code:

<<set $foo to 10>>
<<set $bar = 15>>
<script>
    console.log("Passage loaded!");
</script>

<div>
    <<if $foo > $bar>>
        <div class="was-true">
            $foo ''>'' $bar
        </div>
    <<else>>
        <div class="was-false">
            $foo ''<'' $bar
        </div>
    <</if>>
</div>

Ignoring a smattering of <br>, after being interpreted it roughly outputs:

<script>
    console.log("Passage loaded!");
</script>

<div>
    <div class="was-false">
        10 <strong>&lt;</strong> 15
    </div>
</div>

As far as I can tell, all of this means that I have no off the shelf solution I can use.

Due to the overall complexity and a want for quality error handling, I’m cautious about using a parser generator or lexer generator like nom or logos. At the same time it might not be as simple as stringing together libraries to handle the individual parts.

A library like html5ever wouldn’t be able to handle the macro syntax, and extending it to handle SugarCube is basically rewriting the entire core of the parser. I could do a pass beforehand that escapes anything HTML can’t handle. However, that still needs some level of parsing, and is going to perform poorly.

On the JavaScript front I have some options, but I’m not happy with them. What exists right now looks to be geared towards just transforming JS in some way. Due to that, they do things like completely ignoring comments and whitespace, which I think would be useful to track.

I haven’t looked too deeply at CSS at this point, but it may be the only place I can just grab a library and run.

Have I forgotten to take some aspect into consideration? Is there a silver bullet I missed? Am I being too picky? If I am writing all of this by hand, how to heck to I properly do string handling with potentially arbitrary file encodings? Is it as simple as dropping in encoding_rs, or is there some complexity hidden there?

Past those questions, I’m open to any tips on being a better programmer in general.

1 Like