Rust grammar in OCaml?

Is there some machine readable format of Rust grammar that can be easily transformed into OCaml data structures ?

XY problem: I am currently writing a OCaml program that generates Rust .rs code . Right now, this involves lots of string concat, which is hacky. I am looking for a solution where I build an OCaml in memory object that represents the Rust source file, then do AST -> .rs .


There is no machine readable grammar for rust. There used to be an initiative to write a formal grammar for rust, but nothing has been happening with it for years. Maybe you could create a json file from ocaml and then have a rust program parse this json file and produce rust source code using syn? AFAIK ocaml also has a C ffi interface, so you could use that to load a rust cdylib if you want to avoid a separate program.


If Rust is so complex and context-sensitive, some DCG grammar can help?
Maybe someone knows how to write DCG (with backtracking maybe) in OCaml or Rust?

I believe the rust grammar is LL(2) except for one or two places requiring backtracking.


It is not. That's not why there's no grammar. The reason is that the Rust compiler contains a hand-written parser, so a grammar was not needed.

In general, Rust's syntax is very regular and not at all hard to parse; most notably, it avoids the famous context sensitivity that would require intertwining the lexer and the type checker in other C-style languages.

Some very minor edge cases exist, but they are trivial to resolve; e.g. disambiguating && (short-circuiting and) vs. & & (reference-to-reference) or >> (integer right shift) vs. > > (closing angle brackets for generics) works by the lexer giving out a single & and storing another one when a && is encountered but the parser is looking for a &.


That is how proc macros see it. However the parser actually gets >> unconditionally as single token and splits it when it is in a context where > is expected and >> is not allowed.


The lexer always splits all multi-character tokens (see proc_macro docs) and adds a Spacing field. The parser then uses this to parse into an ast. I would be very surprised if rustc did not use the lexer for regular code.

To the original question: There is syn.json which is arguably a machine-readable format containing ast node types. I don't know if it constitutes a "grammar", but maybe the parts you need can be generated from that.

1 Like

rust/compiler/rustc_ast/src/ at db3c3942ea846c541dd6c34c80fe8470b8a228b1 · rust-lang/rust · GitHub is where the splitting happens in the parser.

rust/compiler/rustc_parse/src/lexer/ at db3c3942ea846c541dd6c34c80fe8470b8a228b1 · rust-lang/rust · GitHub is where the merging of raw tokens from rustc_lexer into combined tokens like >> happens.

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.