[Beginner] Improvements to anagram searching function (exercism)

This is an easy problem from exercism; it's about finding out which candidate words are anagrams of a "reference word".

It passes all tests. The idea is to convert the words into HashMaps, to take advantage of the comparison allowed between them.

One current issue is that &cand.to_lowercase() is repeated.

One possible issue is that filter is used several times for readability, which may be wrong.

use std::collections::{HashMap, HashSet};

/// Puts all true anagrams of `word` into a HashSet.
pub fn anagrams_for<'a>(word: &'a str, possible_anagrams: &[&'a str]) -> HashSet<&'a str> {
    let lw_word = word.to_lowercase();
    let word_hm = letter_count(&lw_word);
    let mut out = HashSet::new();

    possible_anagrams
        .iter()
        .filter(|cand| cand.len() == word.len())
        .filter(|cand| lw_word.ne(&cand.to_lowercase()))
        .for_each(|cand| {
            let candidate_hm = letter_count(&cand.to_lowercase());
            if word_hm.eq(&candidate_hm) {
                out.insert(*cand);
            }
        });

    out
}

/// Convert word into a HashMap of character-count.
/// We don't need i32 because HashMaps can be compared!
fn letter_count(word: &str) -> HashMap<char, u32> {
    let mut count = HashMap::new();
    word.chars().for_each(|c| {
        let v = count.entry(c).or_insert(0);
        *v += 1
    });
    count
}

Yeah, I'd get rid of that too. Perhaps:

    possible_anagrams
        .iter()
        .map(|cand| (cand, cand.to_lowercase()))
        // ...

It's generally fine unless you end up doing too many unnecessary things that could be avoided within a single closure. You could filter a third time instead of for_each, even. It's a matter of style to some extent.

    // `out` is no longer a variable and the function ends with
    possible_anagrams
        .iter()
        .copied() // changes `Item` from `&&str` to `&str`
        // ...
        .filter(|cand| {
            let candidate_hm = letter_count(&cand.to_lowercase());
            word_hm.eq(&candidate_hm)
        })
        .collect()

word isn't put in the returned HashSet, so you can make this change.

-pub fn anagrams_for<'a>(word: &'a str, possible_anagrams: &[&'a str]) -> HashSet<&'a str> {
+pub fn anagrams_for<'a>(word: &str, possible_anagrams: &[&'a str]) -> HashSet<&'a str> {
+//          this part changed ^^^^

Which allows this to compile.

fn example() -> HashSet<&'static str> {
    let word = String::new();
    anagrams_for(&word, &[])
}

fn example_2<'a>(list: &[&'a str]) -> HashSet<&'a str> {
    let word = String::new();
    anagrams_for(&word, list)
}

Everything else seems fine to me, except some Unicode nits, which don't really matter in the context of this learning exercise. The bottom of my reply is about those.

If we ignore those and keep your approach the same, there are various potential optimizations. You could use the same HashMap for all candidates, and you could even avoid allocating new Strings for the lowercase words if you jump through enough hoops (create iterators of lowercased chars and compare those). But I don't think they're worth it unless you think you'd learn something significant from trying them out.

So, this next part that I'm about to point out is probably fine given that this is a learning exercise. But there are some parts that wouldn't work quite as intended with certain inputs. For example, you have

        .filter(|cand| cand.len() == word.len())
        .filter(|cand| lw_word.ne(&cand.to_lowercase()))

But without normalizing inputs, a word may have multiple UTF8 representations (a different sequence of code points), including different lengths. The lowercase version might be a different length than the non-lowercase version. And your HashSet of chars approach doesn't properly account for graphemes that consist of multiple chars (Unicode scalar values), either. Two words with an accent char which applied to different base letters could compare the same, for example.

Normalization and splitting on Unicode graphemes aren't supported by std; you need a third party library to address these problems. That is a reason why learning exercises tend not to require doing the right thing. The weird part about this exercise is that they specifically said they were expanding the problem to Unicode, but then don't actually require an approach that handles all the Unicode issues.[1]

It's not a problem for you in the context of learning exercise, but is something to be aware of should you end up working with Unicode comparisons in a more production context.


  1. If any of them? Maybe they test some 1-to-1 non-ASCII lowercase letters or something. (I don't have an account to play around and see.) â†Šī¸Ž