How to "unquote" a string literal

FedericoStra · September 16, 2022, 10:31pm

Hello!

TL;DR

What's the best way to convert "r#\"foo\\\"#" to "foo\\"?

I'm quite new to procedural macros (I started coding one today) and I have a question which is probably silly and obvious but I can't seem to find a way to solve it easily.

What I'd like to do is to "unquote" string literals, i.e. obtain their contents and not their string representation. Let me explain.

Let's say that I want to code a function-like procedural macro that takes a string literal and substitutes it with a new string literal obtained from it, for instance by adding * both at the beginning and at the end. I want it to accept both string literals and raw string literals.

Here are a few examples of desired expansions:

my_macro!("foo")      // => "*foo*"
my_macro!(r"foo")     // => r"*foo*"
my_macro!(r#"foo"#)   // => r#"*foo*"#
my_macro!("foo\\")    // => "*foo\\*"
my_macro!(r"foo\")    // => r"*foo\*"
my_macro!(r#"foo\"#)  // => r#"*foo\*"#

Let me add as a further constraint that I'd prefer to use only proc_macro and not other dependencies such as proc_macro2, syn, quote...

Disregarding any error management, the naive solution seems to be the following code:

#[proc_macro]
pub fn my_macro(tokens: TokenStream) -> TokenStream {
    let literal = match tokens.into_iter().next() {
        Some(TokenTree::Literal(l)) => l,
        _ => panic!(),
    };
    let new_string = format!("*{}*", literal);
    let new_literal = proc_macro::Literal::string(&new_string);
    TokenTree::Literal(new_literal).into()
}

From the documentation of proc_macro::Literal, the only way to obtain a representation of the specific literal seems to be to use their impl Display for Literal, either via formatting (as I did in the example) or via impl ToString for Literal.

However, the string that we get with these methods is not the contents of the literal that was passed in, but rather a string representation of the literal itself as it appears in the source code. That is, if "foo" was present in the source code, then the string that we get is "\"foo\"". More explicitly, running the expansions above leads to:

my_macro!("foo\\")    // => "*\"foo\\\\\"*"
my_macro!(r"foo\")    // => "*r\"foo\\\"*"
my_macro!(r#"foo\"#)  // => "*r#\"foo\\\"#*"

If I instrument the procedural macro with some debug printing,

#[proc_macro]
pub fn my_macro(tokens: TokenStream) -> TokenStream {
    let literal = match tokens.into_iter().next() {
        Some(TokenTree::Literal(l)) => l,
        _ => panic!(),
    };
    eprintln!("{:#?}", literal);
    eprintln!("{s:?}  --- which represents -->  {s}", s = literal.to_string());
    let new_string = format!("*{}*", literal);
    let new_literal = proc_macro::Literal::string(&new_string);
    TokenTree::Literal(new_literal).into()
}

during the macro expansion I can see that the proc_macro::Literal has some interesting private fields:

kind: it is Str for a simple string literal ("..."), or StrRaw(n) for a raw string literal (r#"..."#) with n number signs (#) on each side;
symbol: this seems to represent the actual characters in the original source that are contained between the quotation marks (if the source is "foo\\", the symbol is "foo\\\\");

however I cannot find a way to access neither of the two, which presumably could be helpful.

I might be missing something obvious, but it seems to me that the proc_macro's public API lacks some sort of functionality to access the contents of the literals that it parses.

Finally...

Question

What is the best way to obtain the contents of a string literal that is passed to a procedural macro, without re-implementing from scratch the parsing of string literals?

semicoleon · September 16, 2022, 10:47pm

syn appears to just use the Display impl on Literal and parses the output

You can see the parsing starts here

pub fn value(&self) -> String {
    let repr = self.repr.token.to_string();
    let (value, _suffix) = value::parse_lit_str(&repr);
    String::from(value)
}

And continues in the value module later in the file.

I think you're probably going to have to do something similar, which I agree is unfortunate.

H2CO3 · September 17, 2022, 4:53am

I don't understand why, though. syn was made exactly so that you don't have to reimplement this functionality yourself. You should really just use it for parsing non-trivial proc-macro input.

simonbuchan · September 17, 2022, 5:22am

Well it adds several seconds to a clean build that generally can't be parallelised, so if it's pretty trivial to avoid using, sure...

H2CO3 · September 17, 2022, 5:24am

Parsing Rust string literals is full of alternatives and edge cases, so I wouldn't classify it as "trivial". At least you could copy over the parsing code from syn, but I definitely don't think you should rewrite it manually.

FedericoStra · September 17, 2022, 11:15am

Ok, at least I didn't miss anything then.

So, just to be sure, are you saying that proc_macro alone does not expose in its public API any way at all to access the value represented by a literal, and users have to parse the literal themselves, either by hand or relying on syn for instance?

This was my impression while exploring the proc_macro documentation, but it seemed too strange to believe.

FedericoStra · September 17, 2022, 11:19am

The reason is not that I'm crazy

I was trying to contribute to a crate whose author expressly asked for avoiding syn if possible. I personally don't have any problem in dumping a quintillion of crates in [dependencies] if it makes coding easier (and especially more correct).

I 100% think I should not re-implement string literal parsing myself.

After a bit more search I came across the crate litrs which does precisely that, while claiming to be a much lighter dependency than syn.

H2CO3 · September 17, 2022, 11:35am

Ah, I see. That's a good reason indeed.

semicoleon · September 17, 2022, 4:53pm

That does appear to be the case, yes

It makes some sense given how proc macros work, and it gives macros the maximum amount of information. But it sure would be nice to be able to just get the "value" of the literal!

VorfeedCanal · September 17, 2022, 6:19pm

I always find such cases fascinating. Just why people expect that compiler do a lot of work which clearly belongs to later stages?

I understand that it's less extreme case than some other people's expectations (some even expected functions defined in file to be available to proc_macros in the expansion time).

But even than… at some point after the construction of the AST, but before the compiler begins constructing its semantic understanding of the program, it will expand all macros… the actual value of some literal clearly part of semantic understanding of the program why would it suddenly arrive at a different stage of processing?

kpreid · September 17, 2022, 8:10pm

The algorithm to find the end of a string literal is nearly identical to the algorithm to fully parse it to its value. Thus, needing to traverse it a second time with an independent implementation of the algorithm can seem like bad engineering.

That's why, in this case, I think.

VorfeedCanal · September 17, 2022, 8:54pm

I don't see how if you see \ then add two bytes, if you see " then stop, otherwise add one byte can be compared to full parse of literal. That's for normal literals. Raw literals have even less similarity: turning then into a string is trivial while turning them into a string is very easy.

Also: proc macro were deliberately restricted to ensure they wouldn't expose too much of rustc internals. Adding something which belongs to a different layer to this interface would have looked strange.

Of course not everything about Rust is logical and I guess I wouldn't have been too surprised to see such function… but I would have classified that as “a strange wart like Range which is Iterator” and not as something natural and obvious.

FedericoStra · September 17, 2022, 10:52pm

As I mentioned, this is my first experience with procedural macros in Rust (but not macros in general). I was probably influenced by Julia, where one has

julia> Base.dump(:([3, 3.14, "pi"]))
Expr
  head: Symbol vect
  args: Array{Any}((3,))
    1: Int64 3
    2: Float64 3.14
    3: String "pi"

julia> :([3, 3.14, "pi"]).args[3]
"pi"

julia> :([3, 3.14, "pi"]).args[3] |> typeof
String

Also, I don't think I'm the first crazy person on Earth with this expectation: recently I managed to find this question on StackOverflow (which incidentally introduced me to the crate litrs as a lighter replacement of syn for this specific task).

I believe it's very legitimate to expect the compiler to expose the way it's going to interpret the literals. He is the one in charge of determining what things mean and it'd be better if this functionality were implemented only once and for everybody to use, instead of having to rely on several independent implementations possibly different from each other and more seriously from the compiler's one.

Anyway, it's futile to debate whether the current arrangement of things is more correct or not. Just don't be so shocked if people have various other reasonable expectations.

On a side note, may I ask what's strange/wrong about Range being Iterator?

Cerber-Ursi · September 18, 2022, 3:46am

Something's off with this phrase - looks like you've repeated the same thing twice, was this intentional or a typo?

simonbuchan · September 18, 2022, 8:09am

I am slightly surprised proc_macro doesn't have at least the match-based macro syntax kinds available, if only for skipping / transcribing and not introspection. Having to parse all possible type syntax in trait bounds just because you're trying to skip past the generic parameter defaults or where clause for some struct is a bit annoying. It's especially annoying for item attribute macros, because rust just parsed this all so it could figure out what to pass to me!

VorfeedCanal · September 18, 2022, 8:59am

Typo. Cutting raw string literal from sequence of bytes is not trivial, but if you know where it begins and ends turning it into string is trivial.

Before Rust 1.0 there were not proc_macro's. Instead Rust had syntax extensions which were tied to the compiler internals. These were much easier to create, but because they used unstable compiler internals they couldn't promise API stability.

When proc_macro were created they reduced API to the bare minimum but promised to keep it stable. The decision to move higher-level processing into separate syn crate is perfectly obvious in that context.

It's one if the few obvious mistakes in Rust design. Range implements Iterator trait instead of IntoIterator. That means that you couldn't do something like this:

You need to use clone because alternative is even worse.

This not the end of the world, and compared to warts of many other languages this is mild issue, but still it's a bad design, as we now understand it.

Yandros · September 18, 2022, 11:38am

For reference:

FedericoStra · September 18, 2022, 2:56pm

That makes a ton of sense indeed.

I don't want to be overly critical of proc_macro. After all, I believe that I'm going to love 95% of Rust proc-macros programming just as I love 95% of regular Rust programming. But, just for the sake of it, let me play devil's advocate against the proc_macro API in an exaggerate way.

Which guarantees do I have that syn::LitFloat::parse("3.14") and litrs::FloatLit::parse("3.14") produce the exact same value as it is interpreted by rustc itself?

Why is this basic functionality of interpreting the value represented by some source code not exposed by the compiler of a language that features meta-programming?

If the proc_macro interface is limited to a textual manipulation of the source without disclosing the nature of it, it might as well just give us a string of the whole text that is passed as an argument to the macro and we pipe it through some monstrous sed script.
Given how intrepid and resolute is Rust about memory efficiency, another thing I don't understand about the proc_macro API is why the only way to access the textual representation of the literals is by calling to_string (which allocates a String) instead of having a method which returns a &str or Cow<'_, str> referencing the source code text.

This aspect of the API gives the impression of having been hastily put together while trying to avoid fighting the borrow checker at all costs. I'm curious to know whether these allocations have any impact on the compilation time of macro heavy code.

H2CO3 · September 18, 2022, 4:37pm

It's not a mistake, it's intentional. It's done so that using it with iterator adaptors is lighter-weight. Eg. you can do (x..y).map(…) without an intermediate into_iter(). Of course, this makes iterating twice more difficult, but this is a trade-off, and a choice was made. It's not at all "obvious" that making ranges IntoIterator instead of Iterator would have been better.

VorfeedCanal · September 18, 2022, 4:58pm

There are no guarantees and that's precisely the point.

To be able to change internals at some later point, obviously.

You can always use the exact version of syn or lirs that you want. You are not supposed to depend on some fixed version of Rust compiler.

This way proc macro can not be properly hygienic.

I wasn't there when proc macro design was made, but I know it was purposefully made as simple and limited as possible to ensure they could be supported for years without the need to freeze compiler internals.

Nothing stops one from adding map function without turning Range into iterator. And yes, it's obvious someone was thinking about whether it's better to have Range as iterator or not.

Judging from number of questions it's now obvious that decision was bad, but I can easily see why it looked like a good idea when it was decided.

Topic		Replies	Views
proc_macro::Literal question	4	407	January 12, 2023
Creating a string literal inside a macro help	6	380	April 9, 2024
Problem with publishing array of [u8] in proc-macro help	3	530	February 7, 2023
Proc-macro modify tokens of inner macro help	1	621	January 12, 2023
Generating a macro in a procedural macro	3	592	January 12, 2023

How to "unquote" a string literal

Related Topics