Ask a short review for my first crate

Dear Rustaceans,

I'm learning Rust, and I just published my first crate, a simple Bayesian spam classifier (< 150 lines without comments and documentation).

https://crates.io/crates/bayespam

I would like to have a short review from an experienced Rustacean.
Are there things that burns your eyes or I could improve ?

Then, I'll update it to v0.2.0, just before v1.0.0.

Thank you very much for sharing your knowledge.

A quick review on my phone, pardon the typing and brevity...

Firstly, I'd be tempted to provide a probability rather than a Boolean prediction. Given that Bayesian statistics only provides probabilities inherently, it seems both more honest and more useful.

It's a bit odd having all the code in a single module other than lib.rs. Doesn't really hurt, but why?

Why have Model as well as Classifier types? It looks like your code could be clearer if a Model were eliminated.

Why eliminate Unicode characters?

Why is get word list a method? It would be cleaner as a function, or a class method without the &self argument.

Why not use a struct rather than an array for your counters? Then your code will require less remembering that 0 is ham and 1 is spam or vice versa.

On line 147ish, using contains key followed by get_mut with unwrap is an antipattern. Instead just use get_mut with if let Some([ham_count, spam_count]). That avoids the extra check and avoids the reader worrying about the unwrap. And saves a few lines of code.

Lines 175 and 182 use the product method. ratings.iter().map(|x| 1.0 - x). product ()

3 Likes

Dear @droundy,

Thank you so much. I really appreciate your review.

They are more suggestions than questions.

I will add all these improvements soon.

Your Model::new() method looks a bit strange to me... You're passing in a boolean flag which will switch between two completely separate implementations, what about breaking it into two constructors (Model::empty() and Model::from_pre_trained())?

The constructor using pre-trained data should also return a Result instead of unwrap()-ing. That way callers can handle file or parsing errors gracefully instead of crashing.

It can also return serde_json::from_reader(json_file) directly instead of storing the Model in a temporary deserialized variable then creating a new Model which just copies all deserialized's fields across.

Likewise, Model::save() should return a Result (hint: see the docs on ?) instead of unwrapping.

The Classified::new() constructor should probably accept an existing Model instead of invoking the Model::new() constructor internally. That way it doesn't matter where I get my Model from (google "dependency injection").

The Classified::get_word_list() method can drop the return and trailing ;.

It feels weird accepting an is_spam boolean inside Classifier::train() because you're essentially doing the same thing as in Model::new()'s with its create_new parameter. Maybe you want to split these into two methods (Classifier::train_spam() and Classifier::train_ham())?

Classifier::score()... wow that's a lot of math. Can you split it up and give each method useful names so we can understand what it's trying to accomplish?

I agree with @zenoxygen in that L147 should be replaced with if let Some([ham_count, spam_count]) = self.model.token_table.get_mut(&h). When reading Rust code unwrap()s jump out as a possible panic!() site and even though it's safe here, it adds unnecessary cognitive load (i.e. I'm lazy and you made me think :stuck_out_tongue:).

2 Likes

Use cargo clippy. Always, always.

I see a lot of C-ish code, a lot of returns, too many muts and too many types written out (e.g. let mut alt_product: f32 = 1.0 should be let mut product = 1.0).

A lot of your loops can be rewritten in a more functional style, e.g. let product: f32 = ratings.iter().product(); (ironically you need the type hint here :wink: )

Your method is_spam doesn't need &mut self (because score doesn't need &mut self either).

There are too many unwraps. Better/proper error handling is prefered (e.g. with the snafu crate).

The regex in get_word_list should better be wrapped in a lazy_static! macro (see regex docs for reasons).

Your token_table should not contain an array with two u32, but instead a naming struct, e.g:

struct Counter {
    spam_counter: u32,
    ham_counter: u32,
}

struct Model {
    token_table: HashMap<String, Counter>
}

You don't need the summarized fields anymore (you can leave them of course for caching), but you have to be careful, that your data is always in sync with each other. That's somewhat error prone. Better provide a "helper method", that gives you the spam and ham total count by adding all of them from your hashmap.

2 Likes

Dear @Michael-F-Bryan,

Thank you so much for your review, you have a lot of good advice.

I like the idea of dependency injection for model handling and split into two different methods for model loading and spams/hams training.

I will improve the readability of the Classifier::score() method and correct my mistakes about the usage of return's.

2 Likes

Dear @hellow,

Thank you for taking time to share your precious knowledge.

Very nice to discover clippy, unfortunately only now.

I definitely should write all loops in a functional programming style and a macro for the regex.

Like @droundy also suggested, using a structure for token_table will be really better.

And of course, also adding a method dedicated to spam/ham counting.

1 Like

@zenoxygen

I did something™ (and I hope I did not brake anything)

1 Like

@hellow,

That's awesome, your changes are highly appreciated.

I love the pattern matching you added in score. And of course all the others things. I learn a lot. Let's merge this now.

Secretly, I'm sad to do not have the pleasure to improve my code anymore. :stuck_out_tongue:

Thank you very much for your contribution.

Some more observations:

  • Publicly-exposed types should implement common traits if they can, e.g. it would be nice to #[derive(Debug, Clone, Default, PartialEq, Eq, Serialize, Deserialize)] for Classifier (and in turn, Model).
  • {Model,Classifier}::new() and {Model::Classifier}::save() would be better off taking an io::Read and an io::Write, respectively. You can't be sure your users want to use files on disk. Maybe they want to send the serialized model through HTTP or something. Honestly I wouldn't even put these methods on Classifier at all. Rather, use dependency inversion, and let the constructor of Classifier take a Model as its argument, and expose the Model type publicly.
  • get_word_list(): it's not idiomatic in Rust to name non-getter functions get_xyz(). You're doing a somewhat heavy computation in that method; call it something like load_word_list().
  • In the same function, it's unnecessary to recreate the regular expression every time you call it. Create it once in new(), put it in the Classifier instance, and then use it through self in load_word_list().
  • The score() function similarly re-loads the classifier upon every call. Consider using lazy_static! instead.
1 Like

Dear @H2CO3,

Thank you so much for all these observations, it helps me a lot. :+1:

Dear @droundy, @Michael-F-Bryan, @hellow, @H2CO3,

Thanks to all your suggestions, I just released v0.2.0 (https://crates.io/crates/bayespam).

I worked on top of @hellow contributions. Changes since my first implementation are:

  • Fix Classifier (now use two constructors new and new_from_pre_trained)
  • Fix Model (keeping only a constructor new_from_pre_trained)
  • Fix token_table (now use a structure Counter rather than an array)
  • Fix model training (now use two different methods train_spam and train_ham)
  • Fix message tokenization method (now called load_word_list)
  • Fix extra check of words in token_table (now use get_mut with if let Some(counter))
  • Fix score method loops (now use the product method)
  • Improve the readability and comprehension of the score method (split and create a new rate_words method)
  • Change identification method name (now identify instead of is_spam)
  • Delete useless returns, muts and types
  • Delete regular expression in load_word_list
  • Add two methods that gives the spam and ham total count (spam_total_count and ham_total_count)
  • Add proper errors handling (not sure it's perfect but it works)
  • Update tests (now test_new, test_from_pre_trained)
  • Update Travis CI configuration (add clippy)
  • Update documentation and README

Next step before releasing v1.0.0 could be to add io::Read and io::Write for loading/saving a Classifier and a Model.

Looking at the documentation for Classifier, there's no reason to list the type of every argument. It's already written in the signature of the function. In the case below, you're even skipping important type information such as the fact that it takes a mutable reference to the file.

pub fn save(&self, file: &mut File, pretty: bool) -> Result<(), Error>

Save the Model into the given file.

  • file - File. The file to write to.
  • pretty - Boolean. Pretty-printed JSON or not.

I would probably have written more in this manner:

pub fn save(&self, file: &mut File, pretty: bool) -> Result<(), Error>

Save the model to file as json. The json will be pretty printed if pretty is true.

1 Like

Dear @alice,

Thank you so much for your feedback.

You're right, it's fixed.

I'm still puzzled as to why Model and Classifier are different types.

1 Like

@zenoxygen

I see three places that look similar to this:

match Model::new_from_pre_trained(file) {
    Ok(pre_trained_model) => Ok(Classifier {
        model: pre_trained_model,
    }),
    Err(e) => Err(e),
}

This is exactly the reason for the question mark operator.
You can simplify these three places with it (probably clippy should lint this).

Ok(Classifier {
     model: Model::new_from_pre_trained(file)?
})
1 Like

Even simpler:

Model::new_from_pre_trained(file)
    .map(|model| Classifier { model })
1 Like

Ok, I deleted Model, keeping only the Classifier structure.
I also simplified the code like you suggested.