Rewriting C++ Parser in Rust. Problem with understanding traits

To learn Rust I want to rewrite an old Project. A very simple arithmetic C++-Parser (only mult and plus). It works as intended in C++. It's not meant to evaluate the arithmetic expression just recognize when parenthesis are mandatory or not (mult and plus).
Example: (5+1) * 2 = (5+1) * 2 but (5 * 1) * 1 = 5 * 1 * 1

I started off trying to implement the equivalent of header files to rust. (I don't want to use bindgen).

There's also ast.h/cpp, parser.h/cpp, but I wanted to start with the tokenizer.

That's the old C++- Tokenizer header:

#ifndef __TOKENIZER__
#define __TOKENIZER__

#include <iostream>
#include <string>
#include <vector>

using namespace std;

typedef enum {
    EOS,           // End of string
    ZERO,
    ONE,
    TWO,
    OPEN,
    CLOSE,
    PLUS,
    MULT
} Token_t;

string showTok(Token_t t);

// Elementary tokenize(r) class
class Tokenize {
    string s;
    int pos;
public:
    Tokenize(string s) {
        this->s = s;
        pos = 0;
    }

    // Scan throuh string, letter (symbol) by letter.
    Token_t next();
    vector<Token_t> scan();
    string show();

};


// Wrapper class, provide the (current) token.
class Tokenizer : Tokenize {
public:
    Token_t token;
    Tokenizer(string s) : Tokenize(s) { token = next(); }
    void nextToken() {
        token = next();
    }
};

#endif // __TOKENIZER__

And its .cpp file only as context:

#include <iostream>
#include <string>
#include <vector>

using namespace std;

#include "tokenizer.h"


string showTok(Token_t t) {
    switch(t) {
        case EOS:   return "EOS";
        case ZERO:  return "ZERO";
        case ONE:   return "ONE";
        case TWO:   return "TWO";
        case OPEN:  return "OPEN";
        case CLOSE: return "CLOSE";
        case PLUS:  return "PLUS";
        case MULT:  return "MULT";
    }
    
}

Token_t Tokenize::next() {
    if(s.length() <= pos)
        return EOS;

    while(1) {

        if(s.length() <= pos)
            return EOS;

        switch(s[pos]) {
            case '0': pos++;
                return ZERO;
            case '1': pos++;
                return ONE;
            case '2': pos++;
                return TWO;
            case '(': pos++;
                return OPEN;
            case ')': pos++;
                return CLOSE;
            case '+': pos++;
                return PLUS;
            case '*': pos++;
                return MULT;
            default:  
                pos++;
                break;
        }
    }
} // next


vector<Token_t> Tokenize::scan() {
    vector<Token_t> v;
    Token_t t;

    do {
        t = next();
        v.push_back(t);
    }
    while(t != EOS);

    return v;
} // scan


string Tokenize::show() {
    vector<Token_t> v = this->scan();
    string s;

    for(int i=0; i < v.size(); i++) {
        s += showTok(v[i]);
        if(i+1 < v.size())
            s += ";" ;         //delimiter
    }
    return s;
} // show

And here's my attempt to rewrite the header in Rust:

enum Token_t {
    EndOfString,
    Zero,
    OneExp,
    TwoExp,
    OpenParan,
    CloseParan,
    Plus,
    Mult,
}

//fn showTok(Token_t: Token)-> string;

struct Tokenize{
    str: string,
    pos: int
}

trait Tokenize{
    fn tokenize(&self)->f64{
        self.str = str;
        self.pos = pos;
    }

    fn next()->Tokenize;
    fn scan()->vector<Tokenize>;
    fn show()-> string;
}


//Wrapper class, provide current Token

trait Tokenizer : Tokenize{
    

}

My understanding is that Rust equivalent of inheritance works with traits. Is that even the correct start to use them? And "self" should reference the struct field using the point operator . am I using it correctly?

No. Don't try to use Rust traits to implement traditional inheritance because it'll lead you down the wrong path and you'll end up getting frustrated because it doesn't work as you expect.

A trait is purely for defining interfaces and intended so people can use polymorphism to switch between implementations. The concept of header files doesn't exist in Rust and using traits to declare your tokenizer type's API will over-complicate things. You can just write out the Tokenizer struct and its methods directly.

Rust lets you say "if you want to implement trait X, then you must also implement trait Y", with a good example being the Copy trait. If you want to implement Copy (roughly the equivalent of C++'s "trivially copyable") then you must also implement Clone (roughly equivalent to C++'s copy constructor)... This is sometimes referred to as trait inheritance because it means that anything implementing trait X will also have the methods of trait Y, kinda like when interfaces inherit from each other in C#.

Think of self like C++'s this, except it may be a shared reference to this (i.e. &self), unique reference (&mut self), or you may be taking this by value (self).

The difference between Rust and C++ is that the this argument is explicit. That's because Rust methods are just syntactic sugar for free functions.

That means in the following snippet...

struct Tokenizer {
  current_position: usize,
}

impl Tokenizer {
  fn tokenize(&mut self, text: &str) -> Vec<Token> {
    self.current_position = 0;
    ...
  }
}

... When you write my_tokenizer.tokenize(input_string), it's the equivalent of writing Tokenizer::tokenize(&mut my_tokenizer, input_string).

You will hardly ever see the latter form (often called "uniform function call syntax"), but it can be useful if you want to avoid ambiguity (e.g. your type implements a trait which has a method with the same name as one of the type's methods) or if you want to pass a method around as a function.

4 Likes

You don't need to, and shouldn't, write any traits for this. Just have an enum Token and a struct Tokenizer, and implement the next function as an inherent method on Tokenizer:

impl Tokenizer {
    pub fn next_token(&mut self) -> Option<Token> {
        // code goes here
    }
}

You can also implement the Iterator trait for Tokenizer. This will give you the equivalent of your scan function for free as

let tokens: Vec<Token> = Tokenizer::new("1 + 2 * 3").collect();

Rust splits "storing data" and "abstracting over behavior" into two language constructs (struct/enum and trait, respectively). This is unlike C++, which uses class for both. In Rust, you only need to write your own trait when you have an interface that's implemented by multiple concrete types (struct, enum, primitive types) and you want to write functions that work for any data that implements that interface. Needing to write your own trait is relatively rare: struct and enum should be your bread and butter.

4 Likes

Another thing about Rust is that it enforces UTF8 boundaries in its Strings, but still indexes by byte offset (so that it take constant time). That means, for example, if you try to index into the middle of an emoji with suffix = &s[pos..], you're going to panic. As such you may want to:

  • Confirm you're working with an ascii String ahead of time so you don't have to worry about it
    • Or just filter all non-ascii out ahead of time since you seem to ignore unrecognized inputs
  • Use chars and be mindful of their UTF8 length to maintain position
  • Use chars and discard parsed input as you go instead of maintaining position
  • Work with Vec<char> instead
  • Work with Vec<u8> instead (but this is often unergonomic to output)
3 Likes

Thank you for all the answers so far!

In regards to the Tokenizer and Tokenize class that I have in C++. Does that mean the Tokenize class/ struct is unnecessary since all the corresponding functions (next, scan, show) are implemented directly in the Tokenizer?
So in Rust the C++ source code and header are mixed because signatures don't need to be defined? Would it be technically possible to then write all the source code of ast, parser, tokenizer cpp and h in one single .rs file? Because it sounds like the differentiation is not mandatory (maybe good for better reading).

And I have another question for this part:

impl Tokenizer {
 fn tokenize(&mut self, text: &str) -> Vec<Token_t> {
      self.current_position = 0;
    }
}

I now understand why &mut is necessary, since Rust unique ownership requires mut to make variables changable. But doesn't it make the whole struct changable? Why is it necessary to have text: &str as a parameter and not:

struct Tokenizer {
    current_position: usize,
    text: string
  }

impl Tokenizer {
    fn tokenize(&mut self) -> Vec<Token_t> {
        self.current_position = 0;
        self.text = "";
      }
}

You can bring algorithms and high-level organization over from C++, but I would be cautious about trying to translate classes and inheritance hierarchies over 1-to-1. They're not a great fit because Rust isn't really an object-oriented language. Data and behavior aren't as tightly coupled as they are in C++.

Rust works best if you think of nouns as structs or enums and verbs as functions or traits. You'll want to avoid the whole "nounifying verbs" trope that's so common in C++ and Java. Naming a struct "Tokenizer" is suspect. Tokenizing is an action. It's not a thing.

Breaking out of the OO mindset means not always trying to shoehorn code into a class/struct/trait/impl. I'd probably go with a free function. It can help to treat your code like a library that will be used by other people and think about what the ideal external API would be. If I were a library user this would be a nice API entry point:

fn tokenize(expr: &str) -> Vec<Token>;

The return type suggests that there should be a struct or enum for tokens:

enum Token { ... }

Notice how the verb "tokenize" is a function and the noun "token" is an enum.

Where's current_position, you ask? It would be a local variable inside of tokenize. It wouldn't really need to be exposed to callers.


This would be a perfectly good design for small expressions. I wouldn't complicate it with on-demand tokenization. However, if you did want to take a streaming approach then you could upgrade this by switching to the Iterator-based approach @cole-miller described.

struct TokenIterator {
    expr: String,
    position: usize,
}

impl Iterator for TokenIterator {
    type Item = Token;

    fn next(&mut self) -> Option<Self::Item> {
        ...
    }
}

tokenize can remain a free function:

fn tokenize(expr: &str) -> impl Iterator<Item = Token> {
    TokenIterator { expr, position: 0 }
}

It can even do a bit of encapsulation: if it returns impl Iterator then TokenIterator doesn't need to be exposed in the public API. It could be a private struct that users don't even see.

Right, Rust doesn't split code between headers and implementation files. The compiler takes care of checking that references to an item in another module or another crate have the correct type signature. And yes, you can get rid of the Tokenize trait.

I would also suggest to look at the implementation of the rust compiler lexer which is located there

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.