Grammar rules in pest (a PEG parser generator)


#1

Can someone shed some light into the grammar rules of pest (a PEG parser generator) or what I’m doing wrong in this simple example. I try to translate some bison/flex rules. One of them is:

NUMBER [-+]?([0-9]+|(([0-9]+\.[0-9]*)|(\.[0-9]+)))([eE][-+]?[0-9]+)?

Maybe I’m too tired to see my mistake but I did translate that to:

impl_rdp! {                                                                                             
    grammar! {                                                                                          
        // IDENT [a-zA-Z_][a-zA-Z_0-9]*                                                                 
        ident =  { ['a'..'z'] | ['A'..'Z'] | ["_"] ~                                                    
                     (['a'..'z'] | ['A'..'Z'] | ["_"] | ['0'..'9'])* }                                  
        // NUMBER [-+]?                                                                                 
        //        ([0-9]+|                                                                              
        //         (                                                                                    
        //          ([0-9]+\.[0-9]*)|                                                                   
        //          (\.[0-9]+)                                                                          
        //         )                                                                                    
        //        )                                                                                     
        //        ([eE][-+]?[0-9]+)?                                                                    
        number = {                                                                                      
            (["-"] | ["+"])? ~                                                                          
                (['0'..'9']+ |                                                                          
                 (                                                                                      
                     (['0'..'9']+ ~ ["."] ~ ['0'..'9']*) |                                              
                     (["."] ~ ['0'..'9']+)                                                              
                 )                                                                                      
                ) ~                                                                                     
                (["e"] | ["E"] ~ (["-"] | ["+"])? ~ ['0'..'9']+)?                                       
        }                                                                                               
    }                                                                                                   
}                                                                                                       

The full example code is here:

In the example I read the string to parse from a file, but basically it matches -.00123456789 but not e.g. -0.0123456789:

./target/release/examples/pest_test -i assets/scenes/pest_test.pbrt
FILE = assets/scenes/pest_test.pbrt
[Token { rule: number, start: 0, end: 13 }]

vs.

./target/release/examples/pest_test -i assets/scenes/pest_test.pbrt
FILE = assets/scenes/pest_test.pbrt
thread 'main' panicked at 'assertion failed: parser.end()', examples/pest_test.rs:89
note: Run with `RUST_BACKTRACE=1` for a backtrace.

Are the brackets a problem? The examples I found were pretty simple and worked. Do I have to split into several rules?


#2

I think the problem is that in

for the input -0.01... the first branch is chosen, which does not include a decimal point. Like in regex implementations, | is not required to try all branches and find the longest match. Reordering the branches should work.


#3

Hi @birkenfeld, thanks you your answer. Reordering works:

        // NUMBER [-+]?([0-9]+|(([0-9]+\.[0-9]*)|(\.[0-9]+)))([eE][-+]?[0-9]+)?                         
        number = {                                                                                      
            (["-"] | ["+"])? ~ // optional sign, followed by                                            
            (                                                                                           
                (                                                                                       
                    (["."] ~ ['0'..'9']+) // dot and digits                                             
                        | // or                                                                         
                    (['0'..'9']+ ~ ["."] ~ ['0'..'9']*) // digits, dot, and (optional digits)           
                )                                                                                       
                    | // or                                                                             
                ['0'..'9']+ // just digits                                                              
            ) ~ ( // followed by (optional)                                                             
                (["e"] | ["E"]) ~ // 'e' or 'E', followed by                                            
                (["-"] | ["+"])? ~ // optional sign, followed by                                        
                ['0'..'9']+ // digits                                                                   
            )?                                                                                          
        }                                                                                               

For the exponent “e” or “E” I had to use brackets as well, otherwise the one worked, but not the other.