How to capture all whitespace in proc_macro argument

Hello,
I'm trying to write some procedural macros that take as input some asm z80 code.
I've been able to write such macro if I provide the z80 code in a string.
For example:

let listing = parse_assembly!(" ld a, 0 
     ld a, 0");

However, I have difficulties if I provide the z80 code as it is directly within the macro:

parse_assembly!( ld a, 0 
     ld a, 0    );

In this case, I am able to transform these tokens in a string (with TokenStream::to_string()) in order to make my treatments on it. However, this is not the expected string. What I obtain is:

 ld a, 0 ld a, 0

The line return has disappeared. Sadly it is important for my z80 parser to keep this line return.

I do not really know where to look at in order to properly generate the right string with the line return.

Is there anybody able to guide me there ?
Thanks

At first you need check what tokens contains TokenStream,
it is possible that to_string makes some simplification. What output of println!("{:?}", token_stream) ?

I feel stupid to have not tested that first ...
Anyway here is the result:

TokenStream [Ident { ident: "ld", span: #0 bytes(2104..2106) }, Ident { ident: "a", span: #0 bytes(2107..2108) }, Punct { ch: ',', spacing: Alone, span: #0 bytes(2108..2109) }, Literal { lit: Lit { kind: Integer, symbol: "0", suffix: None }, span: Span { lo: BytePos(2110), hi: By
tePos(2111), ctxt: #0 } }, Ident { ident: "ld", span: #0 bytes(2118..2120) }, Ident { ident: "a", span: #0 bytes(2121..2122) }, Punct { ch: ',', spacing: Alone, span: #0 bytes(2122..2123) }, Literal { lit: Lit { kind: Integer, symbol: "0", suffix: None }, span: Span { lo: BytePos
(2124), hi: BytePos(2125), ctxt: #0 } }]

So whitespace do not clearly appear.
However, thanks to this extract

span: Span { lo: BytePos(2110), hi: By
tePos(2111), ctxt: #0 } }, Ident { ident: "ld", span: #0 bytes(2118..2120) 

I see there are 7 unused bytes. I know there are space,line return, space, space, space, space, space.
I need to find a way to access to them now in order to check that. However TokenStream does not seem to provide these bytes.

I missed that you pass string. Then you can use syn to get LitStr and then you can call LitStr::value to get the "raw" bytes.

Edit

May be you need pass string as raw string.

Thanks for your answer. I think I was not enough clear. So let me copy paste the two tests that are suppose to have the same behavior:

#[test]
fn test_macro_parse_assembly_several_instructions_b() {
    let listing = parse_assembly!(" ld a, 0 
     ld a, 0");

    assert_eq!(listing.len(), 2);
    assert_eq!(listing[0], Token::OpCode(Mnemonic::Ld, Some(DataAccess::Register8(Register8::A)), Some(DataAccess::Expression(0.into()))));
    assert_eq!(listing[1], Token::OpCode(Mnemonic::Ld, Some(DataAccess::Register8(Register8::A)), Some(DataAccess::Expression(0.into()))));

}

and

#[test]
/// does not pass yet :(
fn test_macro_parse_assembly_several_instructions_e() {
    let listing = parse_assembly!( ld a, 0 
     ld a, 0    );

    assert_eq!(listing.len(), 2);
    assert_eq!(listing[0], Token::OpCode(Mnemonic::Ld, Some(DataAccess::Register8(Register8::A)), Some(DataAccess::Expression(0.into()))));
    assert_eq!(listing[1], Token::OpCode(Mnemonic::Ld, Some(DataAccess::Register8(Register8::A)), Some(DataAccess::Expression(0.into()))));

}

I want to design my macro such as use 1 is similar to use 2. Maybe rust forbid that (and I guess so).

Regarding your asnwer, I have tried to use:
let input: syn::LitStr = parse_macro_input!(item); instead of let str_listing = item.to_string();where ìtemis the input TokenStream`of the macro.

In the old way (to_string()), I was able to rebuild a string but it was lacking of line returns and erroneous.
With the new way (LitStr) I obtain a compilation issue saying that a literal is expected.

So LitStrway is till not what I need. But it is still better than my first try and I'll keep it for now as it force the compiler to raise an issue sooner instead of producing a wrong string.

TL,DR: you need to pass a string literal if whitespace is significant to you.

Indeed, procedural macros work after the lexer pass, i.e., on a tokenized input; and this tokenization discards all1 the whitespace information.

If you really want to support this feature, however, you can hack your way into achieving this, but it is gonna be painful:

  1. As you have noticed, tokens keep the span information to help generate better error messages (c.f., the famous ^^^^ pointing to the problematic code)

    error: expected expression, found keyword `else`
     --> src/main.rs:3:14
      |
    3 |     for _ in else {}
      |              ^^^^ expected expression
    
  2. With this span information (mainly byte offsets + associated source file) you could open and read that source file to get, again, the raw code given to the macro.

    This is incredibly dirty and hacky, because:

    • it will get everything, including comments,

    • it may be affected by a race condition on the contents of the source file,

    • the fact that procedural macros have "unlimited" access to the filesystem is an unfortunate side-effect of their current implementation, and could very well stop being the case in the future (there is an idea of encapsulating the code of the macro being run within some virtual machine, such as WASI),

    • code formatting could alter the input to the macro.


1 It actually keeps just a tiny bit of information to be able to, for instance, distinguish => from =·>, in the case of punctuation (but =·> and =··> are still identical)

Thanks for this answer.
So I'll get stick to this string literal.

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.