Why is the invocation of this declarative marcro an error?

macro_rules! separate_each_ident {
    ($(id:ident ::)* $id2:ident)=>{}
}
fn main(){
    separate_each_ident!(A::B::C);
}

The invocation won't be compiled, the compiler reports

error: unexpected end of macro invocation

In my mind, $(id:ident ::)* means there are zero or more than one constructs that must be an identifier followed by the operator ::, in this example, they are A::, B::, and C should be matched with $id2. Why does the compiler not parase the invocation in this way?

Moreover, consider this example:

macro_rules! another_macro {
	($($m:ident)::* $id2:ident) => {
	};
}
fn main(){
    another_macro!(A::B::C E);
}

The compiler will report the error:

local ambiguity when calling macro another_macro: multiple parsing options: built-in NTs ident ('m') or ident ('id2').

However, as indicated by $($m:ident)::*, which means there are zero or more than one identifiers that are separated by the operator ::, in this example, A, B, C are separated by ::, and hence E matches $id2. There shouldn't exist other possibilities. Why does the compiler report the ambiguity errror?

1 Like

I suspect the compiler error you got in your first example was because of a typo more than a misunderstanding on your part : v) It should be:

 macro_rules! separate_each_ident {
-    ($(id:ident ::)* $id2:ident)=>{}
+    ($($id:ident ::)* $id2:ident)=>{}
 }
 fn main(){
     separate_each_ident!(A::B::C);
 }

Whenever you declare a token, it needs to be of the form $name:type, even if it's inside of a container like $(...)*. The reason for this is you might want to match multiple tokens per repetition.

However, even after fixing this, you'll run into the same error you got in your 2nd example!

In my mind, $(id:ident ::)* means there are zero or more than one constructs that must be an identifier followed by the operator ::, in this example, they are A::, B::, and C should be matched with $id2. Why does the compiler not parase the invocation in this way?

Your comment is actually completely correct! $($id:ident::)* will match zero or more identifiers followed by :: and id2:ident will match a trailing identifier. Both true. However, macros rules must be fully unambiguous. They aren't a simple regex matcher that takes the first match that works.

And looking at the string A::B::C there are 3 ways the macro parser could match this:

  1. It matches the repetition twice (A:: and B::), and the trailing ident once (C)
  2. It matches the repetition once (A::), and the trailing ident once (B), then fails on ::C
  3. It matches the repetition zero times, and the trailing ident once (A), then fails to match ::B::C

Now, to us humans, obviously #1 is the only sane way to parse this thing, since the others will fail for your string; but the parser works one token at a time! So when it starts parsing the string and all it sees is A, it can't know which path to take (unlike us humans who can read with infinite lookahead).


So, how to fix this?
Easy, just switch the order of your macro!

 macro_rules! separate_each_ident {
    ($id:ident $(:: $ids:ident)*)=>{}
 }

Here there is no ambiguity, because the parser always has to match an ident at the beginning, and then the only rule left is to match the repetition. Whereas in the original macro, the parser always had a choice to make: "do I keep trying to match this repetition? Or do I skip it and start matching tokens after it?"

4 Likes

The last two choices imply that the compiler does not match $($id:ident ::)* as much as possible, i.e. the longest sequence that can form id::, if we think in this way(i.e. the way the compile think), then the following would have more choices too:

macro_rules! try_match{
    ($($id:ident ::)*)=>{}
}
fn main(){
   try_match!(A::B::C::);
}

By the above logic, the choices can be:

  1. It matches the repetition once (A::), and fails on B::C::
  2. It matches the repetition twice (A:: and B::), and fails on C::
  3. It matches the repetition three times (A::, B::, and C::), and succeed

I drew up some graphs to show how the macro will attempt to parse it's provided inputs.
Note that these aren't completely technically correct, but they're good enough for explanation!
If you really want to get into the nitty gritty, looking up "Context Free Grammars". But be warned, it can turn into a pretty deep rabbit hole... anyways


On the left is macro you just mentioned (which is totally fine), on the right is the original macro you posted (which was ambiguous).

The red circles represent the parser's current state. They're just random letters with no meaning.
The macro parser starts in start, then checks what the next token is to decide which state to move to next. So, for the left macro, typing My:: will cause the following to happen:

  1. Macro starts in start and sees an ident token is next ("My"), so it moves to state A.
  2. It sees a :: token is next, so it moves back to the start state.

The reason why the right graph is ambiguous is because if the parser starts in start then sees an ident token, there are 2 paths it can take! One for id and one for id2. And it can't know which is more correct.

  • If it moves to C this corresponds to skipping the repeated tokens.
  • If it moves to A this corresponds to going into the repeated tokens.

I don't know the reason, but that indeed is the case, repetition is not greedy. the following is an issue for the exact same problem, you can also read the related issues if you are want to dig deeper:

1 Like

In the right-hand case, for tokens X::Y::Z do you mean at the start of parsing, the compiler can consider the identifier X to go the way along either $id:ident or $id2:ident?

Yep! That's exactly what the problem is.

If you look at the error message you got originally, this is what it's attempting to explain:
local ambiguity when calling macro another_macro: multiple parsing options: built-in NTs ident ('m') or ident ('id2').
It says there are "multiple parsing options" (ie: multiple paths it could take).
And those 2 paths are for ident(m) and ident(id2), but in my example I renamed m to id.

It's not an easy error message to decipher, but that's what it's saying.

However, doesn't $($id:ident ::)* appear first that precedes $id2:ident? It doesn't make sense to match an identifier that would first match the second $id2:ident instead of matching it to the first declared pattern. In my mind, it should be as much as possible to match the first pattern because it appears before the second one. In your explanation, the compiler seems to directly ignore the first one to match the token with the second one, which results in the unnecessary ambiguity.

However, doesn't $($id:ident ::)* appear first that precedes $id2:ident?

Nope, context free grammars don't have a notion of precedence.
Theoretically, swapping the order of these two things shouldn't make a difference.

In your explanation, the compiler seems to directly ignore the first one to match the token with the second one, which results in the unnecessary ambiguity.

Not having precedence might seem really stupid, but there's no way around it.
Even with precedence the ambiguity would still exist, it would just be hidden.

Let's say that macros did use this kind of order-based precedence, and look at how it would parse A. Just A, nothing else.

Looking at $(id:ident ::)* $id2:ident, the parser sees two paths it can take, id and id2.
But with our new rules, it gives id precedence. But now we have a problem, because the parser expects to see a ::, but it isn't there!

Adding a precedence rule would technically fix the ambiguity, but it also makes id2 useless.
The macro parser would never match it, because whenever it sees an ident, it will give id precedence.

These ambiguities aren't the result of "no precedence", but the result of the parser only being able to work one token at a time.

2 Likes

Looking at $(id:ident ::)* $id2:ident, the parser sees two paths it can take, id and id2.

I am curious about why the ambiguity can be eliminated when we insert a , comma between the $(id:ident ::)* and $id2:ident? That is, the compiler now can see what will match id and what will match id2.

In this case, compiler can't match id2 before it successfully matches comma, it's that simple.

macro_rules! try_match{
    ($($id:ident ::)+ 𑄭)=>{}
}
fn main(){
   try_match!(A:: 𑄭  );
}

Note 𑄭(\u{1112d}) is not an identifier because an identifier must start with XID_Start why this example is not correct?

I guess that's the exact reason - there is no lexable construct (note that macro input must be lexed in any case) starting with this symbol. This example is incorrect even if you comment out macro invocation - macro definition already triggers the error.

So, back to the macro issue, do you mean a construct keeps matching the meta-variable until that construct can't be the kind of that meta-variable?

macro_rules! separate_each_ident {
    ($($id:ident ::)* $id2:ident)=>{}
}
fn main(){
    separate_each_ident!(A::B::C);
}

In this example, after matching A:: with $id:ident ::, when the compiler encounter B, it knows it's an identifier, so B can either match $id:ident or $id2:ident even though B followed by :: that can be more suitable to match $id:ident ::?

Macros doesn't know, what the current token (or, strictly speaking, the current token tree, i.e. either a single token or a parenthesized group) is followed by. It knows only this specific token tree, nothing more.

1 Like

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.