To learn more about Rust and brush up my C skills, I have tasked myself with converting a C library into Rust. Because this is a bit more work I wanted to get help and opinions from more experienced users early in the process so I don't make common mistakes and save myself a bit of work. The project itself isn't too complex in my eyes, it just has a few nasty pitfalls I'd like to avoid.
To prevent pasting a whole chunk of code in this post, I created a GIST containing the original code and my two attempts at converting it. Since the original C code was split into multiple files I merged it, and also added liberal comments with my understanding of the logic. The lexer_v1.rs file is my first take on translating it as verbatim as possible in all its completely unsafe glory. The lexer_v2.rs file is my iteration on it replacing raw pointers like *mut u32 with &[u32] (as far as I understand *mut and *const refer to the pointer, not the data, right?).
Now what do I want to know?
Well first and foremost: Have I done anything wrong that goes above transferring the code wrong, when the idea itself was correct.
Second: Anything I missed where I could make the code safer? The thing that jumps to mind are the two *_in_the_middle functions on lines 133 to 149 the lexer_v2 file. Here I take the pointer to a u32 array, cast it to an u8 array, then index it by shifting the pointer and then casting it back to u32. I know this walks the array in byte steps, and not in dword steps, which is required here because we index the predefined token tables, but do I need unsafe pointer casting?
If you need anything else I'd be happy to provide it
Not sure what you mean by this. *mut Foo is a pointer to a mutable Foo, and *const Foo is a pointer to an immutable Foo.
Indeed, that is incorrect. char in Rust is not u8, it's a full Unicode code point, so it's 4 bytes wide. But if you had used u8, that would have been equally incorrect (undefined behavior in fact): the pointer would then be unaligned, and it's not allowed to read from an unaligned pointer just like that.
I'm not sure why you need this kind of indexing, but it seems fishy. If you need an array that is indexable at the byte granularity, then use [u8], not [u32]. Normally, for a simple algorithm such as a lexer, you shouldn't need to use, and it's hard to justify, any unsafe.
I was thinking in the line of *const Foo equalling Foo* const in C and not const Foo*, which of course is either a const pointer to a mutable object and a mutable pointer to a const object respectively.
Yeah, you're right that should have been an u8 and not a char. I completely forgot that in rust a char is not simply ASCII.
That's pretty much along my line of thinking. The token tables are either uint32_t or uint16_t. From what I could gather of the original code the author takes the current character from the source file and looks up (in the equiv_table) to which token the character would map (stored in the trans_table). These are then stored in the token buffer.
Why he does that? I think because that way he can map multiple characters to a single token index and thus keep the tables smaller. But I'm not entirely sure. But this also something I want to further uncover while translating the code.
Oh! Another thing I couldn't find a solution for. I C I can do this: #define ulex_token(name) Tok_ ## name. Is there any eqivalent in Rust? I'd guess macros, right?
In my experience, you absolutely need to have a good test coverage of the code you're translating. It's very easy to introduce regressions.
Use c2rust.com or citrus to convert the syntax automatically. Keep the C files and the ABI the same, so that you can translate one function at a time. Don't refactor things until they're all in Rust, and have tests.
You can do this with macros, yes (e.g. concat!()), but it's not idiomatic. That list of tokens should be an enum with type name Token and variant names not repeating the tok_ prefix. Rust has proper namespaces.
That doesn't help; the same thing is UB in C, too, so then the original code is also incorrect. You'll need to find an alternative instead of translating the incorrect code. There's read_unaligned in Rust, but that doesn't work, either, since the pointers you obtain by address manipulation don't point to single u32 or u16 objects.
Whenever I see pointer casts like this, one of my first questions is always, "does this consider platform endianness properly?" I'm not sure this does. Taking a &[u8] and using u32::from_le_bytes and friends would make this aspect far clearer, and also resolve the problem of unaligned reads.
Regarding Endianess: The original library will run on Win and not cross platform, so endianess wouldn't be an issue.
Yeah, I'm aware of that. The original code hasn't many tests, but if I go further than me just trying to get into the language better, then writing more tests was one of my more important steps.
Thanks for linking those. I'll definitely have a look at them!
Would it though? The following is just me talking off the top of my head (so tell me If I'm wrong and I'll shut up about it and find another solution ), but here's how I interpreted the following code. Assume that the lexed source code starts with the token-string "scriptname foo" (I hadn't posted the tables, because I thought them unnecessary for the code itself, but I might have been wrong. I have "cleaned"/preprocessed the tables so I can make some manual calculations. The tables can be found HERE)
I went on and calculated the first iteration for a string and I got proper values and indices (unless I made calculating mistakes). I think because inside the trans table the original developer did this for almost all elements: (0x80000000 | 4 << 16) | 0x0, where in this case 4 is the int value of Tok_Error which is defined in the parser.c file as an anonymous enum. In combination with the (token_buffer[i] >> 16) & 0xfff it seems he gets what he wants.
I also found this comment in the parser.c file
This file implements a Papyrus parser. Lexical analysis is done by a fast generated state machine based lexer. The lexer does not produce separate tokens for keywords, instead a perfect hash is used to lookup keywords during parsing. The parser is a simple hand-written LL(k). The parser uses a stack-like data structure for building syntax lists before committing them into the syntax tree.
So as far as I can tell. His method is weird, but it works. Would I write the same code? Most likely not. All of you have pointed out that there are many pitfalls to consider. I have to think about this a bit more and try to really understand how he what he did. But currently I'm leaning to writing a completely new lexer to leverage rust fully (or use one of the creates)
By "proper indices", do you mean that the pointers always ended up aligned correctly? If so, great, but the code doesn't actually enforce that (on the contrary, it suggests that the opposite is possible), so it's extremely brittle and in bad style even if technically correct. At the very least, if indexes are alwas meant to be 4-byte-aligned, then the table values should probably be divided by 4 and the calcuated indexes should be used to index into the array of 32-bit integers as-is.