Matching rust code via macro_rules!

It seems to me that macro_rules!(aka macros by example) could've been made to make matching normal rust code easier than it currently is.

Looking at the grammar for enum for example, if you wanted to accept and parse an enum definition passed as the arg(s) to a macro_rules! macro, implementing such a macro(still in progress, will post result here) you'll soon find some obvious, quite show-stopping, limitations:

  1. Can't express matching "THIS or THAT", without also having the possibility of matching both and none as well. So you can do $( THIS )? $( THAT )? which matches: THIS, THAT, THISTHAT, and neither of them.
    Imagine matching a GenericParam which is:
    OuterAttribute* ( LifetimeParam | TypeParam | ConstParam ) in EBNF(?) grammar.
    So, you can't express that you want to match only 1 of those 3: LifetimeParam, TypeParam or ConstParam, therefore you've to resort to wrapping all 3 into $( )?, thus matching 0, 2 or 3 of them at once becomes a possibility(not just mandatorily 1 of either of them as you actually wanted), and thus hope that when transcribed it will compile error(ie. delegate the invalidly matched cases to the compiler to error on), which is an ok-ish workaround let's say. However, this introduces the possibility of hitting the limitation in 2. below, which prevents even this workaround in matcher from being used.
  2. Follow-set Ambiguity Restrictions make it harder to match normal rust code given the limitation in 1. because, when you have those 3 blocks of $( )?, then at the end of the 2nd block which is the
    TypeParam : IDENTIFIER( `:` TypeParamBounds? )? ( `=` Type )? block, where you match $( = $that_type:ty )?, you'll find that the next block is
    ConstParam: `const` IDENTIFIER `:` Type ( `=` Block | IDENTIFIER | `-`?LITERAL )? which means you'll be matching this:
    $( const $some_id:ident bla bla )?, but now you get a compile error like
    `$that_type:ty` may be followed by `const`, which is not allowed for `ty` fragments, which happens because each of those 3 blocks ( LifetimeParam | TypeParam | ConstParam ) had to be wrapped in $( )? thus all 3 can appear at the same time(so compiler has to assume this case)! I've such broken example(work in progress) here.
  3. Another less severe issue is that if you want to match
    EnumItems : EnumItem ( `,` EnumItem )* `,`? , so you don't want to allow lone comma to be matched, you can't do this
    $( EnumItem ),* $(,)? because this matches just a , alone there, but instead you've to do this:
    $EnumItem_1of2:ident $( , $EnumItem_2of2:ident )* $(,)?
    which means now you've to duplicate the EnumItem matcher block(which I made just ident for this example) twice and the metavars must be differently named in each of the two EnumItem matcher blocks, here's an example in playground of how that might look even if you want to merge those into one via internal rules after.
  4. line 102 in this requires me to recursively match the very thing I'm inside the matcher matching for: GenericParams (in this case). I'm sure it's possible somehow(I just don't know how atm), but I didn't expect to hit this for some reason.

(there's more limitations, but I don't care to spam about at the moment)

So if either 1. or 2. limitations wouldn't exist, it would be easier to match normal rust code.

Apparently there's some convoluted way to do it, maybe(?) I'm unclear because I find it difficult to reason about and understand it, here.

Now, is the next gen macro thing(aka declarative macros 2.0) able to handle this properly ? I haven't yet gotten to it in my reading.

And why make it so difficult to match rust code? (am I doing it wrong?)
There should be, I say, code in the rustc repo tests that would try to match full rust code for things like a whole enum definition(based on its EBNF grammar as referenced) via macro_rules just to assure everyone that the macro system is proper, especially since it's trying so hard to ensure it's parsing valid rust code already, because why else would limitations in 2. be imposed.

Can't?

macro_rules! this_or_that {
    (this) => { println!("this"); };
    (that) => { println!("that"); };
}

fn main() {
    this_or_that!(this);
    this_or_that!(that);
}

Those restrictions are not arbitrary. They are not designed to actively make your life harder. They are probably the least restrictive reasonable set of rules that still allow you to get most stuff done, without shooting yourself in the foot with context-dependent syntax.


TL;DR: You are using the wrong tool for the job. If you want to parse Rust code of arbitrary complexity, just write a procedural macro.

You've a good point, that was too generic and strong of a word there. I guess I meant, can't express that inside other matcher code, without having to duplicate the matcher(and transcriber?) code blocks that are before/after it or something. Or, maybe I haven't properly grasped it yet.

I can't use procedural macros. Seemed like macro_rules would've/should've worked.

I mean, look at this grammar: Trait and lifetime bounds - The Rust Reference
and I get compile error(playground):
`path` may be followed by `+`, which is not allowed for `path` fragments: not allowed after `path` fragments
But is that true? that EBNF grammar disagrees

Syntax
TypeParamBounds :
   TypeParamBound ( + TypeParamBound )* +?

TypeParamBound :
      Lifetime | TraitBound

TraitBound :
      ?? ForLifetimes? TypePath
   | ( ?? ForLifetimes? TypePath )

in macro_rules it says path: a TypePath style path (here)
therefore matching a path can totally be followed by a + according to that valid rust code grammar, because TraitBound ends with a path, and the next TypeParamBound can start with +.

Why?

You are confusing "not allowed to appear in Rust code" with "not allowed to appear in a macro matcher". Sometimes a pattern being allowed in the language is exactly the reason for a related matcher to be disallowed, because it would lead to ambiguity.

For example, you can't match $ty:ty<$parm:ident>, precisely because <ident> is a valid suffix for a type (when it's generic), so the parser would have no way to know which one you meant and when to stop matching.

1 Like

I'd be happy to use them if they didn't have the restriction of needing to be in their own separate crate. If the restriction was, for example, proc macros need to be in a separate module, then all good, but a different crate?! can't do (well it's the same kind of can't like the THIS/THAT issue, I guess, trying to avoid some kind of a maintenance mess of sorts)

right, that makes some sense, thanks!

I'm trying to see how that applies to TypePath aka path in macro_rules, when it's followed by a +, but I can't immediately find that a + is valid part of a path, but maybe if I were to explore all inner things..., or at worst case it's a bug but given what you said, seems unlikely.

You can parse a lot using incremental tt munchers approach.

https://danielkeep.github.io/tlborm/book/pat-incremental-tt-munchers.html

As for avoiding maintenance mess - I'd say go with proc macros. Using syn you are much less likely create something hard to maintain than doing non-trivial parsing with macro_rules. There are downside to proc macros, but I don't think this is one of them.

3 Likes

Probably better to link the actively maintained version at https://veykril.github.io/tlborm/decl-macros/patterns/tt-muncher.html.

And yes, you can. I know from experience you can, at the least, parse type declarations in macro_rules!.

3 Likes