Having a hard time reading and understanding an unsafe block

I am looking at the code in the punkt crate https://crates.io/crates/punkt used for sentence segmentation.

The code has a few unsafe blocks all pretty much involving the same pattern (see here for reference rust-punkt/trainer.rs at master · ferristseng/rust-punkt · GitHub)

Here is what it looks like

&mut *(data as *const TrainingData as *mut TrainingData)

or in another place

&mut *(t as *const Token as *mut Token)

The code does provide a helpful comment as a clue to what's going on

// Annotating the tokens requires an unsafe block, but it won't modify any pointers,
// just will modify some flags on the tokens.

From reading the Rust book I can see why this code is unsafe, it is dereferencing raw pointers, but that's it.

But I am having a hard time figuring out what this pattern is doing or why it is needed.

I would appreciate it if anyone can shed some light on what's going on.

That code raises some red flags for me.

Essentially, what they are doing is taking the &mut TrainingData that was passed in as an argument and storing it as a &TrainingData in the ReclassifyIterator. This is perfectly fine because any &mut T can always be coerced into a &T.

Next, because of the way their ReclassifyIterator is structured, as they are iterating over it, they also want to add items to the TrainingData. This causes some issues because both the ReclassifyingIterator and the for-loop need access to the TrainingData and the naive implementation would run afoul of the borrow checker (ReclassifyingIterator has a &TrainingData, but in order to insert abbreviations, the for-loop needs a &mut TrainingData, and the borrow checker only lets you have mutliple &TrainingData references XOR one &mut TrainingData).

To hack around the situation they've put themselves in, they've made data a shared (immutable) reference and are then using pointer casting to "transmute away" the immutability so their data: &TrainingData variable is accessible as a &mut TrainingData.

I think the authors need to read this part of The Nomicon about transmutes:

  • Transmuting an & to &mut is Undefined Behavior. While certain usages may appear safe, note that the Rust optimizer is free to assume that a shared reference won't change through its lifetime and thus such transmutation will run afoul of those assumptions. So:
    • Transmuting an & to &mut is always Undefined Behavior.
    • No you can't do it.
    • No you're not special.

The correct solution would be to rewrite ReclassifyingIterator such that you can update it while iterating.

It's hard to suggest a fix without knowing all the context or reading the source code in-depth, but this might mean they don't implement the Iterator trait and take some sort of while let Some(token) = reclassify_iter.next() { let data = &mut reclassify_iter.data; ... } loop. You could also restructure things so that iterating doesn't require training data at all.

12 Likes

The operation on t is probably undefined behavior [1] since t is a &Token reference (item of a slice iterator). Try running some usage of this code through miri and see if it reports the problem.


  1. I haven't fully understood the type of data yet ↩︎

3 Likes

Thanks, this makes sense. I was going to write my own implementation based on the Python NLTK library and found this crate which was pretty old and unmaintained, but I thought it would be easier to follow the Rust implementation since I am writing Rust.

But looking at this maybe it would be easier to reimplement it on my own based on the Python one rather than follow what they are doing. I thought there was some hidden performance reason why they are doing this.

It might be worthwhile to make your own implementation that draws inspiration from NLTK and rust-punkt. That way you can create a library that matches how you think and is tailored to your specific use case. As a happy bonus, it also gives you a better understanding of how tokenization works, which is very useful when you are troubleshooting funny bugs or unintuitive results later on.

The rust-punkt crate is only about 2kloc of Rust, so you should be able to get a proof-of-concept version up and running over a weekend. The repo also contains loads of examples and tests you can reuse, so it'll be a great reference.

To be fair to the author, while the unsafe code in that repo is unsound and reasonably simple to fix, I can also understand the thought process that was going on when they wrote the code. Depending on how much Rust you've written before, you might find you unconsciously avoid situations where you "fight the borrow checker" and feel the need to reach for unsafe.

1 Like