Copy-free parser questions


#1

I’m working on a localization system called Fluent. Part of Fluent is our localization DSL called FTL.

When working on FTL parser, so far I used an approach that defines AST with String properties and copied data from the source string into the AST.

One of the ideas we have is to attempt to design the parser to be copy-free and just use &str of the original string in AST.

The issue I encountered while trying to work on it is that we have cases where in the copy-full parser approach we alter the data as we read it. Two examples are comments and escape characters in strings.

Example:

# This is
# a multiline comment
# in FTL
key = Value { $placeable } but this is a regular character \{ 

In the current parser, the result AST will look more or less like this:

Resource {
  body: [
    Comment {
        content: "This is\na multiline comment\nin FTL"
    },
    Message {
        id: "key",
        value: [
            TextElement { content: "Value "},
            Placeable {},
            TextElement { content: " but this is a regular character {"}
        ]
    }
  ]
}

In the copy-free parser, I’m not sure how to deal with those two cases. Should I store the comment with # characters and “parse” the content of the comment again when reading?

What’s the canonical approach to such cases?


#2

The way some parsers deals with this is to give a range (start, end) offsets into the data being parsed with a tag. This is how https://github.com/pest-parser/pest works for example.


#3

Do you mean that an AST of the Comment content would be a list of slices for each line, like this:

Comment {
  content: [(start, end), (start, end), start, end)]
}

and the AST for the Message TextElement would be a list of slices omitting the escape character?


#4

One interesting zero-copy approach to parsing is to produce a concrete syntax tree instead of an AST, and do conversion to AST as a separate pass. This design is described a bit here: https://github.com/matklad/rfcs/blob/libsyntax2.0/text/0000-libsyntax2.0.md#untyped-tree.

Why do you want a copy-free parser for fluent though? I would expect that a carefully designed AST could be more compact and efficient to process, and looks like you don’t need IDE capabilities (which are the main motivation behind libsyntax2 design).

It seems to me that what you actually want is efficient data strucutre: avoiding copies can sometimes lead to efficiency, but not always. For fluent, I would thing something like this makes sense?

struct FluentFile {
    // A single string holding a concatenation of all literals, which gives a single allocation
    // and interning. 
    strings: String,
    // An AST represented as a struct of arrays, with `u32` indices.
    messages: Vec<Message>,
    values: Vec<Values>,
    text_elements: Vec<TextElement>,
}

#5

I’m on my phone so I can’t give a detailed explanation but if you look at the pest readme and the line “This produces the following output:” should give you an idea of how it works.


#6

You could use Cow to have &str for simple strings without escaped chars, and String where the string had to be changed.

Another option is to allow source to be modified in-place. In case of the comment and escapes you’d shift other characters left as you skip syntax-related chars:

# This is
a multiline comment
in FTLXX
key = Value { $placeable } but this is a regular character { X