Beginner level: Is my code idiomatic? (not urgent) :-)

Hi Rust community,

I'm learning Rust. Here is one of my earliest attempt at testing Rust for its potential to tackle NLP (natural language processing).

I believe that this short piece of code is very easy to read, but I'm not sure, it would be considered idiomatic though.

Any suggestion, comments, remark, advice is welcome.

Thanks in advance!

use std::collections::HashMap;
use std::fs::File;
use std::io::{BufRead, BufReader};

fn main() {
    // Opening file: ("100-0.txt" is a large text file:
    // http://www.gutenberg.org/files/100/100-0.txt)

    let path = "100-0.txt";
    let input = File::open(path).expect("- - Pb reading file: test.txt");
    let buffered = BufReader::new(input);

    // Storing words counts in Map:

    let mut m: HashMap<String, i32> = HashMap::new();

    for line in buffered.lines() {
        let ln = line.expect("- - Pb parsing line (from test.txt)");
        for word in ln.split_whitespace() {
            *m.entry(word.to_string()).or_default() += 1;
        }
    }

    let keys = m.keys();

    // Storing map in vect for sorting according to count:

    struct WordCount {
        word: String,
        count: i32,
    }

    let mut words_count_vect: Vec<WordCount> = vec![];

    for item in keys {
        words_count_vect.push(WordCount {
            word: item.to_string(),
            count: m[item],
        });
    }

    // Sorting vect:
    words_count_vect.sort_by(|a, b| b.count.cmp(&a.count));

    for items in &words_count_vect[0..10] {
        println!("{} \t{}", items.count, items.word);
    }
}

1 Like

I'd probably use an unsigned type for counts, but it's not a big concern.

Introducing struct WordCount in the middle of the function is a bit odd.

You can use iterators to shorten (and make more efficient) the conversion between HashMap and Vec:

    let mut words_count_vect: Vec<WordCount> = m.into_iter()
        .map(|(word, count)| WordCount { word, count })
        .collect();

You can also use iterators for the reporting:

    // for items in &words_count_vect[0..10] {
    // Maybe sometimes we have < 10 words:
    for items in words_count_vect.iter().take(10) {
        println!("{} \t{}", items.count, items.word);
    }
6 Likes

Thank you! :slight_smile:

Iterators are generally considered idiomatic, but I hasten to add that this doesn't mean you always need to use them. Readability is important too! Some complex iterator chains can be expressed as simple for loops.

Anyway, another use of iterators you might want to use is to convert your HashMap to a Vec:

Basically, use this trait method: std::iter::FromIterator - Rust which is implemented by Vec.

If you want the WordCount struct you might need to map (String, i32) to it. I'd also change i32 to a unsigned your. usize is normal for counting (len() returns usize)

Also thanks for posting, because I didn't realise you could do what I've just suggested. It seems like a really useful method though.

1 Like

This might be rewritten to

use std::cmp::Reverse;
words_count_vect.sort_unstable_by_key(|a| Reverse(a.count));
  • IMO _by_key is generally more clear than _by.
  • Since words_count_vect comes from an HashMap the order has no meaning anyway, so you might as well use the faster and non-allocating unstable sort.
4 Likes

While we're on that tangent :

vec.sort_(unstable)_by_key(|item| item.field);
for item in vec.iter().rev().take(10) { /* do foo */ }

would be an alternative to using std::cmp::Reverse, since iterating on a Vec front-to-back or back-to-front makes no difference.

2 Likes

Some minor stylistic issues:

  1. You can save a lot of recompilation time during testing by accepting a file name at the command line:

    let path = std::env::args().first().unwrap_or("100-0.txt");
    
  2. Shadowing variable names while converting the same logical thing between types is generally considered idiomatic:

    let input = File::open(path).expect("...");
    let input = BufReader::new(input);
    

    since BufReader takes ownership over the File, there's no risk of accidentally using it again, and in other cases, where it might not take ownership, the type system usually helps avoid accidentally using the wrong value.

  3. counts or counter would probably be a better name for the HashMap than m.

  4. I actually found myself writing this same pattern in a couple places yesterday, and found it handy to extract the functionality into a new Counter type, with an add() method for inserting objects into the collection. I based it on a BTreeMap instead of a HashMap, since I wanted to do reporting that depended on having the keys sorted, but the idea is the same. Whether this is better or worse depends a bit on your read:write ratio. How often will you use this data structure? If it's just once, keeping it inline makes sense. Twice, maybe create a basic Counter. Ten times? Flesh it out with FromIterator, and IntoIterator traits, methods to access min / max values, bucket by percentiles, etc, or look for an existing crate on crates.io.

  5. The WordCount struct feels like a detour just for sorting into your vec. I would just use a tuple for this.

  6. No need to explicitly declare the type of word_count_vec. It will be inferred the first time you push an object into it. And since you already know the size, you can save some processor cycles by doing let words_count_vec = Vec::with_capacity(m.len());.

  7. m.keys() gives you a borrow into the keys of your HashMap, but you are 1. accessing the key and value and 2. never using the HashMap again, so if you replace the whole thing with iterating directly over the HashMap, you can avoid re-allocating the strings (item.to_string()).

    for (word, count) in m {
        words.count_vect.push((word, count));
    }
    

    As others have said, you can also use FromIterator to create the Vec directly, though in this case you will need to declare at least part of the type, because FromIterator can create lots of different types. If you are using a tuple instead of the WordCount type, it becomes even cleaner:

    // The following all do the same thing.  Pick one.
    let word_count_vec: Vec<(String, i32)> = m.into_iter().collect(); // Annotate the type stored in word_count_vec
    let word_count_vec: Vec<_> = m.into_iter().collect();  // Annotate the container type stored in word_count_vec.  Let the compiler infer the item type.
    let word_count_vec = m.into_iter().collect::<Vec<(String, i32)>>>();  // Declare the type whose FromIterator implementation we are using to call .collect().  
    let word_count_vec = m.into_iter().collect::<Vec<_>>>();  // Declare the container type whose FromIterator implementation we are using to call .collect().  Let the compiler infer the item type.
    

    I prefer the second of the four in most cases.

These are generally minor issues, though a couple of them (not cloning each word to String a second time, declaring the size of the Vec up front) will have some performance benefits. Overall, your code looks great. Keep it up!

5 Likes

Thanks a lot for the precious feedback.
I'm really starting to like the Rust language.

1 Like

Ok Rust experts, here's how it looks now:
Cool!

use std::collections::HashMap;
use std::fs::File;
use std::io::{BufRead, BufReader};

fn main() {
    // Opening file: ("100-0.txt" is a large text file:
    // http://www.gutenberg.org/files/100/100-0.txt)

    let path = "100-0.txt";
    let msg = format!("- - Pb reading file: {}", path);
    let input = File::open(path).expect(&msg);
    let input = BufReader::new(input); // shadowing on purpose.

    // Storing words counts in Map:
    let mut word_map: HashMap<String, u32> = HashMap::new();

    for line in input.lines() {
        let msg = format!("- - Pb parsing line (from {})", path);
        let ln = line.expect(&msg);
        for word in ln.split_whitespace() {
            *word_map.entry(word.to_string()).or_default() += 1;
        }
    }

    // map to vect:
    let mut word_count_vect: Vec<_> = word_map.into_iter().collect();

    // Sorting vect:
    word_count_vect.sort_by(|a, b| b.1.cmp(&a.1));

    for x in word_count_vect.iter().take(10) {
        println!("{} \t{}", x.0, x.1);
    }   
}
1 Like

I appreciate keeping line length down, but I'd probably insert the format call directly into the expect. You might want to break the line probably, but the code becomes slightly easier to read, since the reader doesn't have to wonder whether msg is reused later on, and doesn't have to look to see where it is used at all.

    let msg = ;
    let input = File::open(path).expect(&format!("- - Pb reading file: {}", path));

There is also a minimal memory reduction benefit, since the String is freed earlier, right after it is used.

1 Like

Furthermore, in this case it probably would be idiomatic to use unwrap_or_else:

let input = File::open(path)
    .unwrap_or_else(|_| panic!("- - Pb reading file: {}", path));
4 Likes

I don't see the benefit of unwrap_or_else to then just panic. So what's the rationale here?

I expect unwrap_or and it's close cousin unwrap_or_else to be used to provide an alternative value, the suggestion being that normal operation could continue afterwards.

2 Likes

The rational would be to not allocate the error string on success case.

7 Likes

This generally looks like good code to me. This *m.entry(word.to_string()).or_default() += 1; line in particular looks beautiful, especially if you're just recently learning the language.

A couple things:

  • I'm not sure why you're prefixing your error messages with "- - ". I would just remove it.

  • I'm not sure why you're assigning m.keys() to a variable, you could simply remove that and make your loop for item in m.keys().

  • It's usually abbreviated "vec", I've never heard someone say "vect".

  • Instead of this:

    words_count_vect.sort_by(|a, b| b.count.cmp(&a.count));
    

    You can do this:

    words_count_vect.sort_by_key(|item| item.count);
    
  • Instead of this:

    for items in &words_count_vect[0..10]
    

    You can do this:

    for items in words_count_vect.iter().take(10)
    

    Which is not only more idiomatic, it avoids panicking if the vec has fewer than 10 elems.

    • Extra note: if you do use indexing, you can do [..10] instead of [0..10].
    • Extra note 2: you would usually name the iteration variable item, not items. Singular.
  • Since the wordcount count cannot sensibly be negative, you might want to store it as a u32 instead of an i32.

  • Although lifetimes are difficult and beginners get tripped up on lifetimes all the time, this is actually a case where it could work well.

    If you rewrote WordCount as:

    struct WordCount<'a> {
        word: &'a str,
        count: i32,
    }
    

    Then when you populate your vec, you could replace word: item.to_string() with word: item (I think. You might have to do word: item.as_str() if that gives you trouble.) and it would avoid making a new allocation and copying the memory.

  • Instead of populating words_count_vec with push, you can create it from an iterator, as such:

    let mut words_count_vect: Vec<WordCount> = m.keys()
        .map(|item| WordCount {
            word: item.to_string(),
            count: m[item],
        })
        .collect();
    
  • Instead of iterating over your map keys, then performing a lookup by key (m[item]), you can simply iterate over the key/value tuples, as such:

    for (word, &count) in &m {
        words_count_vec.push(WordCount {
            word: word.to_string(),
            count,
        });
    }
    

    (Note: previous suggestions can be combined together with each other).

6 Likes

I've used that pattern to death in almost every task these past few months, so much so that I've made a short type and function I often copy paste between projects.

1 Like

Thank you very much!

1 Like

Iterators are generally considered idiomatic, but I hasten to add that this doesn't mean you always need to use them. Readability is important too! Some complex iterator chains can be expressed as simple for loops.

That's true. I stumbled over a problem where I wanted to short-circuit an iterator from inside a closure. This is not possible because you don't return from the outer function if you return from a closure. I then rewrote iterator-heavy code to for loops and use break.

1 Like

Sounds like you wanted try_fold/try_for_each, however you'll need to short-circuit with Result::Err, which is not very clear about what you're actually doing. There's a ControlFlow enum which is more clear, however it's unstable.

3 Likes