Tricky macro problem: Accept a sequence of possibly empty items (e.g. 1,2,,3) AND handle trailing commas properly

Hi,

In the library that I am working on, I would like to store sequences of sequences of things compactly in memory by prepending each subsequence with its length. (In reality the things are symbols, and I use a union type for the element, and everything is properly encapsulated, but for simplicity integers will do here.)

So, for example, in the program below, the macro invocation seq_of_seqs![-1 -2, , -3] evaluates to

[
    2,  // Number of elements in first subsequence
    -1, // First element of first subsequence
    -2, // Second element of first subsequence
    0,  // Number of elements in second subsequence
    1,  // Number of elements in third subsequence
    -3, // First and only element of third subsequence
]

Here is the full code

// Based on
// https://veykril.github.io/tlborm/decl-macros/building-blocks/counting.html#bit-twiddling
#[doc(hidden)]
#[macro_export]
macro_rules! _count_exprs {
    () => { 0 };
    ($single:expr) => { 1 };
    ($odd:expr, $($a:expr, $b:expr),*) => { ($crate::_count_exprs!($($a),*) << 1) | 1 };
    ($($a:expr, $even:expr),*) => { $crate::_count_exprs!($($a),*) << 1 };
}

#[macro_export]
macro_rules! seq_of_seqs {
    () => {{
        [0; 0]
    }};
    ($($($number:literal )*),* ) => {{
        [
            $(
                $crate::_count_exprs!($($number),*)
                $(, $number )*
            ),*
        ]
    }};
}

fn main() {
    dbg!(seq_of_seqs![-1 -2, , -3]);
}

This works, but the problem is that in accordance with Rust conventions, I would like to ignore trailing commas. I.e. I would like seq_of_seqs![-1 -2, , -3,] to evaluate to the same thing as above, but since I also need to support empty subsequences, seq_of_seqs![-1 -2, , -3,,] should evaluate to

[
    2,  // Number of elements in first subsequence
    -1, // First element of first subsequence
    -2, // Second element of first subsequence
    0,  // Number of elements in second subsequence
    1,  // Number of elements in third subsequence
    -3, // First and only element of third subsequence
    0,  // Number of elements in fourth subsequence
]

Note that this is conceptually coherent. (Incidentally, I discovered that this is how JavsScript handles array literals.)

I know how to either

  • accept empty subsequences without handling trailing commas properly (that’s the example above),

  • or accept trailing commas when no empty subsequences are allowed.

But despite a lot of trying and searching, I have not found a solution to have both. It’s probably not helpful if I list all the things that I have tried (like explicitly matching sequences of commas with nothing in-between), because none of them have worked.

I would be grateful for any ideas or suggestions.

You could try this (playground):

The first, hidden macro can be simplified (unless I misunderstood what it had to do). For the second one, you have to separate the cases by moving the comma inside the repetition, which forces a last comma in the pattern. Then you must take into account the case without ending comma (fortunately, you can express it with the macro itself).

What's nice is that you can use a pattern like $(x,)* and instantiate it on the right side with $(x),*, which will create a nice comma-separated list, without a trailing comma, even if there was one in the source. :slight_smile:

#[doc(hidden)]
#[macro_export]
macro_rules! _count_exprs {
    () => { 0 };
    ($_:expr $(, $a:expr)*) => { 1 + $crate::_count_exprs!($($a),*) };
}

#[macro_export]
macro_rules! seq_of_seqs {
    () => {{
        [0; 0]
    }};
    ($($($number:literal)*,)*) => {{
        [
            $(
                $crate::_count_exprs!($($number),*)
                $(, $number )*
            ),*
        ]
    }};
    ($($($number:literal)*),*) => { $crate::seq_of_seqs!($($($number)*),*,) }
}
1 Like

The simpler version will crash the compiler for long inputs. The clever version solves this problem, see the link in the comment preceding it.

Moving the comma inside the repetition was one of the things that I tried, but I did not manage to come up with the second part of the solution. Many thanks, this is exactly what I was looking for!

Good point, however here that’s not really needed, it seems. If we move the comma into the repetition on the right side, it continues to work just as well.

1 Like

Interesting hack to improve the recursion limit! I find it less readable, however, though declarative macros offer poor readability in general.

If you need many long run sequences in a file, I wonder if a procedural macro wouldn't be better for the performances of the compiler and of the code awareness in an editor. There's an initial cost, and it requires an extra crate for the code, but it may pay off.

I'm talking about the pattern. If you change the 2nd pattern of seq_of_seqs from ($($($number:literal)*,)*) to ($($($number:literal)*),*), you'll get

  • seq_of_seqs![,] == [0, 0] instead of [0]
  • seq_of_seqs![,,] == [0, 0, 0] instead of [0, 0]

and so on, with a single trailing comma producing a final 0.

I believe that moving the comma inside the parentheses in the code instantiation (on the right side) won't work, either, at least in seq_of_seqs. In the 2nd pattern, you'll insert an extra comma between the run count and the values, which is syntaxically incorrect, and in the last one, you'll generate an extra comma at the end, which will produce an extra 0 in some cases.

I expect that in practice there won’t be more than 10 elements in a sequence. But I can imagine a use case where another macro gets defined that internally uses the above one with much longer sequences. Overall, I prefer not to have artificial limits in my code, especially if the cost is as low as here: a well-documented complication of the internals of a hidden macro.

I might eventually have to deal with sequences of symbolic expressions instead of sequences of sequences of symbols. In that case I’ll probably want to represent the expressions using RPN, and I’d need a parens-expression-to-RPN compiler macro. It is well possible that I’ll have to fall back on procedural macros... I already have the book!

Hmm, I still do not see your point, but let’s not dwell on this, unless you care.

What I meant is that your version of seq_of_seqs can be rewritten as

#[macro_export]
macro_rules! seq_of_seqs {
    () => {{
        [0; 0]
    }};
    ($($($number:literal)*,)*) => {{
        [
            $(
                $crate::_count_exprs!($($number),*)
                $(, $number )*
            ,)*
        ]
    }};
    ($($($number:literal)*),*) => { $crate::seq_of_seqs!($($($number)*),*,) }
}

(the only change is replacing ),* by ,)* in the outer repetition of the second arm’s template). This will have the consequence that the macro will evaluate to an array literal with a trailing comma, but since Rust ignores these, I do not see any observable consequence.

Of course we cannot do the same replacement in the template of the third arm, because we would end up with double trailing commas which would change the meaning. But then in the third arm the outer repetition pattern is the same as the outer repetition template (namely both use $(...),*), while I understood your point as that it could be useful to have $(... ),* in the pattern and $(... ,)* in the template, or vice versa. I think that’s true, and it’s an interesting observation, but I do not see where it would be necessary here.

Note that you want expressions, you can also use commas to separate them and semicolons to separate the runs. I think it's a small change.

That book is very good; I've read it not long ago. :slight_smile:

It was just to clarify. I first explained that the pattern ($($($number:literal)*,)*) could be instantiated as ($($number),*), which eliminates a trailing comma. That's what solves your original problem, so I thought it was worth mentioning.

You replied it didn't matter and the comma could go inside on the right side. I see now you were talking about something else entirely, but so it was worth the clarification in case you need to modify the macro.

The thing to remember and which confused me at first with those macros is:

  • With ),*, the comma is managed automatically in the pattern (trailing commas are accepted but optional) or the instantiation (it won't produce a trailing comma).
  • With ,)*, the comma is mandatory—it's managed by the programmer: your example above will produce a trailing comma, and my pattern requires a trailing comma.

I could, but this is supposed to mimic mathematic notation for the library’s users (the sequences represent tensor products), so having a space as separator reads nicer, e.g. indices![a b, c]

Also, when I said that I might need expressions, I didn’t mean Rust expressions (i.e. the expr fragment type of macro_rules!), but rather symbolic expressions that might involve parens as opposed to the simple sequences of symbols that I have now. Perhaps something like indices![(a b)^5, c].

1 Like

Isn't it confusing to handle the trailing comma? Now if you want the final dimension to be empty you need to write dbg!(seq_of_seqs![-1 -2, , -3,,]);.

Thanks for your interest!

You are right that it could be confusing, but

  • the alternative of making all commas meanigful would have the consequence that seq_of_seqs![-1 -2,] would be different from seq_of_seqs![-1 -2], which would be even more confusing in a Rust context,

  • empty dimensions won’t be used much. They are there for completeness, like empty strings, or zero-dimensional tensors,

  • I have never written a single line of JavaScript, but I discovered that they use the same convention for their arrays (see link above), so this solution is not totally unheard of...

That's fair. It's confusing either way, which makes me wonder whether the lack of grouping is going to end up looking good.

I happen to have been working on some typed tensor indexing code and went down the rabbit hole of writing macros to implement destructuring, unzip, map and zipping of arrays in const contexts, and then how to write macros for my macros so it's easier to implement callbacks. I'm curious how things turn out for you!

Thanks for the pointers. Is your library public? (Mine will be, once there is more to show.)

No I'm still in the rabbit hole. I want to use tensors as the unit of computation in my micrograd implementation eventually. The typed multi-dimensional array view is here: micrograd-rs/src/view.rs at main · mickvangelderen/micrograd-rs · GitHub. The WIP for indexing is here Add strided_core · mickvangelderen/micrograd-rs@8339467 · GitHub. Very rough still and working on the tools to write this code instead of the code itself a.t.m.