Copy-free parser questions

zbraniecki · January 19, 2018, 5:00am

I'm working on a localization system called Fluent. Part of Fluent is our localization DSL called FTL.

When working on FTL parser, so far I used an approach that defines AST with String properties and copied data from the source string into the AST.

One of the ideas we have is to attempt to design the parser to be copy-free and just use &str of the original string in AST.

The issue I encountered while trying to work on it is that we have cases where in the copy-full parser approach we alter the data as we read it. Two examples are comments and escape characters in strings.

Example:

# This is
# a multiline comment
# in FTL
key = Value { $placeable } but this is a regular character \{

In the current parser, the result AST will look more or less like this:

Resource {
  body: [
    Comment {
        content: "This is\na multiline comment\nin FTL"
    },
    Message {
        id: "key",
        value: [
            TextElement { content: "Value "},
            Placeable {},
            TextElement { content: " but this is a regular character {"}
        ]
    }
  ]
}

In the copy-free parser, I'm not sure how to deal with those two cases. Should I store the comment with # characters and "parse" the content of the comment again when reading?

What's the canonical approach to such cases?

emoon · January 19, 2018, 5:58am

The way some parsers deals with this is to give a range (start, end) offsets into the data being parsed with a tag. This is how https://github.com/pest-parser/pest works for example.

zbraniecki · January 19, 2018, 6:36am

Do you mean that an AST of the Comment content would be a list of slices for each line, like this:

Comment {
  content: [(start, end), (start, end), start, end)]
}

and the AST for the Message TextElement would be a list of slices omitting the escape character?

matklad · January 19, 2018, 8:49am

One interesting zero-copy approach to parsing is to produce a concrete syntax tree instead of an AST, and do conversion to AST as a separate pass. This design is described a bit here: https://github.com/matklad/rfcs/blob/libsyntax2.0/text/0000-libsyntax2.0.md#untyped-tree.

Why do you want a copy-free parser for fluent though? I would expect that a carefully designed AST could be more compact and efficient to process, and looks like you don't need IDE capabilities (which are the main motivation behind libsyntax2 design).

It seems to me that what you actually want is efficient data strucutre: avoiding copies can sometimes lead to efficiency, but not always. For fluent, I would thing something like this makes sense?

struct FluentFile {
    // A single string holding a concatenation of all literals, which gives a single allocation
    // and interning. 
    strings: String,
    // An AST represented as a struct of arrays, with `u32` indices.
    messages: Vec<Message>,
    values: Vec<Values>,
    text_elements: Vec<TextElement>,
}

emoon · January 19, 2018, 8:51am

I'm on my phone so I can't give a detailed explanation but if you look at the pest readme and the line "This produces the following output:" should give you an idea of how it works.

kornel · January 19, 2018, 11:15am

You could use Cow to have &str for simple strings without escaped chars, and String where the string had to be changed.

Another option is to allow source to be modified in-place. In case of the comment and escapes you'd shift other characters left as you skip syntax-related chars:

# This is
a multiline comment
in FTLXX
key = Value { $placeable } but this is a regular character { X

Topic		Replies	Views
Fluent (l10n system) parser/AST/serializer decision help	3	455	July 21, 2019
Problems with a recursive parser help	16	1957	January 12, 2023
New CommonMark parser announcements	27	10241	January 12, 2023
The Copy trait - what does it actually copy? help	29	10457	January 12, 2023
Struct Impl function value moved in previous iteration of loop help	3	5392	January 12, 2023

Copy-free parser questions

Related topics