Rust grammar in OCaml?

Is there some machine readable format of Rust grammar that can be easily transformed into OCaml data structures ?

XY problem: I am currently writing a OCaml program that generates Rust .rs code . Right now, this involves lots of string concat, which is hacky. I am looking for a solution where I build an OCaml in memory object that represents the Rust source file, then do AST -> .rs .

Thanks!

There is no machine readable grammar for rust. There used to be an initiative to write a formal grammar for rust, but nothing has been happening with it for years. Maybe you could create a json file from ocaml and then have a rust program parse this json file and produce rust source code using syn? AFAIK ocaml also has a C ffi interface, so you could use that to load a rust cdylib if you want to avoid a separate program.

2 Likes

If Rust is so complex and context-sensitive, some DCG grammar can help?
Maybe someone knows how to write DCG (with backtracking maybe) in OCaml or Rust?

I believe the rust grammar is LL(2) except for one or two places requiring backtracking.

2 Likes

It is not. That's not why there's no grammar. The reason is that the Rust compiler contains a hand-written parser, so a grammar was not needed.

In general, Rust's syntax is very regular and not at all hard to parse; most notably, it avoids the famous context sensitivity that would require intertwining the lexer and the type checker in other C-style languages.

Some very minor edge cases exist, but they are trivial to resolve; e.g. disambiguating && (short-circuiting and) vs. & & (reference-to-reference) or >> (integer right shift) vs. > > (closing angle brackets for generics) works by the lexer giving out a single & and storing another one when a && is encountered but the parser is looking for a &.

2 Likes

That is how proc macros see it. However the parser actually gets >> unconditionally as single token and splits it when it is in a context where > is expected and >> is not allowed.

2 Likes

The lexer always splits all multi-character tokens (see proc_macro docs) and adds a Spacing field. The parser then uses this to parse into an ast. I would be very surprised if rustc did not use the lexer for regular code.

To the original question: There is syn.json which is arguably a machine-readable format containing ast node types. I don't know if it constitutes a "grammar", but maybe the parts you need can be generated from that.

1 Like

rust/compiler/rustc_ast/src/token.rs at db3c3942ea846c541dd6c34c80fe8470b8a228b1 · rust-lang/rust · GitHub is where the splitting happens in the parser.

rust/compiler/rustc_parse/src/lexer/tokentrees.rs at db3c3942ea846c541dd6c34c80fe8470b8a228b1 · rust-lang/rust · GitHub is where the merging of raw tokens from rustc_lexer into combined tokens like >> happens.

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.