Avoiding duplicating strings

I'm practicing Exercism's Rust Track. One of the questions is as follows:

An anagram is a rearrangement of letters to form a new word.
Given a word and a list of candidates, select the sublist of anagrams of the given word.

Given "listen" and a list of candidates like "enlists" "google" "inlets" "banana" the program should return a list containing
"inlets".

The solution is case insensitive, which means "WOrd" is the same as "word" or "woRd". It may help to take a peek at the std library for functions that can convert between them.

The solution cannot contain the input word. A word is always an anagram of itself, which means it is not an interesting result. Given "hello" and the list ["hello", "olleh"] the answer is ["olleh"].

You are going to have to adjust the function signature provided in the stub in order for the lifetimes to work out properly. This is intentional: what's there demonstrates the basics of lifetime syntax, and what's missing teaches how to interpret lifetime-related compiler errors.

Try to limit case changes. Case changes are expensive in terms of time, so it's faster to minimize them.

If sorting, consider sort_unstable which is typically faster than stable sorting. When applicable, unstable sorting is preferred because it is generally faster than stable sorting and it doesn't allocate auxiliary memory.

My code, that works and passes all test cases, is as follows:

use std::collections::{hash_map::HashMap, HashSet};

pub fn anagrams_for<'a>(word: &str, possible_anagrams: &[&'a str]) -> HashSet<&'a str> {
    let mut freq_map: HashMap<String, u32> = HashMap::new();
    let mut indices: Vec<String> = Vec::new();

    word.chars().for_each(|ch| {
        let c = to_lowercase_if_not_already(ch);
        *freq_map.entry(c.clone()).or_default() += 1;
        indices.push(c);
    });

    possible_anagrams
        .iter()
        .filter(|candidate| {
            if word.len() != candidate.len() {
                false
            } else {
                let (ana, same) = is_anagram(&freq_map, &indices, candidate);
                !same && ana
            }
        })
        .cloned()
        .collect()
}

fn to_lowercase_if_not_already(ch: char) -> String {
    if ch.is_uppercase() {
        ch.to_lowercase().to_string()
    } else {
        ch.to_string()
    }
}

fn is_anagram(
    freq_map: &HashMap<String, u32>,
    indices: &[String],
    candidate: &str,
) -> (bool, bool) {
    let mut f_map: HashMap<String, u32> = HashMap::new();
    let mut score = freq_map.len();
    let mut same = true;

    for (i, ch) in candidate.chars().enumerate() {
        let c = to_lowercase_if_not_already(ch);
        if !freq_map.contains_key(&c) || f_map.get(&c) == freq_map.get(&c) {
            return (false, false);
        }
        *f_map.entry(c.clone()).or_default() += 1;
        if f_map[&c] == freq_map[&c] {
            score -= 1;
        }
        same &= c == indices[i];
    }
    (score == 0, same)
}

However, I'm little bothered by the use of clone and cloned. I've a feeling that this is a case for Rc, although I'm not sure how. Any advice?

Here is a simpler version that doesn't copy any strings.
Its asymptotic complexity is O(N * M * log M) where N is the number of possible anagrams given (i.e. possible_anagrams.len()) and M is the length of the longest possible anagram.
Note that it does allocate Strings as part of checking whether 2 string slices are anagrams.

Thank you for your response. There a few things I'd like to point out with your code:

  1. The question specifically asks to minimize the use of to_lowercase, but you convert every word to lowercase.
  2. You also convert every word to a Grapheme vector. If a candidate begins with a Grapheme tcluster he word doesn't, it can't be an anagram, so, there's no need to look at the rest of the candidate.
  3. If a candidate starts with two counts of a Grapheme cluster but the word contains only one, it can't be an anagram, so, there's no need to look at the rest of the candidate.
  4. You use sort, while the question suggests using sort_unstable.
  5. Sorting is takes linearithmic time. We can avoid sorting by comparing frequency maps, which doesn't have the short circuit behavior I mentioned in 2 and 3, but better than sorting and takes linear time.

It appears to me you didn't read the question fully because your code doesn't follow the suggestions.

I answered your question "can it be done without cloning?" affirmatively. No more, no less.
In particular I didn't dive into the particulars of the assignment since that would have been a pretty large time suck. You could use the code as inspiration and adapt your solution to get rid of cloning.

Sorting may take time, but that's accounted for in the runtime complexity. It seems no worse asymptotically than your version, even accounting for the sorting my version does.

I answered your question "can it be done without cloning?"

I don't think it makes sense without the constraints imposed by the exercise, since, the question isn't about alternatives to cloning in general, but only in the context of the given question.

Sorting may take time, but that's accounted for in the runtime complexity. It seems no worse asymptotically than your version

I don't see how, because not even accounting for the short circuit behavior I previously explained, for a list of m candidates, the word is sorted as many times, and each candidate is sorted once. If all the candidates are anagrams and the word length is n, it takes m * n log (n) time, where as without sorting, we spend m * n time.

Anyway, thanks for taking the time.

Your cloned is going from &&str to &str, it's nothing to worry about. You can emphasize the cheapness by using copied instead (but it does the same thing in this case).

You can also almost surely just use char instead of String. If a single char maps to multiple char when lowercased, it's going to correspond to some sort of combining situation for an accent or the like. Without parsing actual graphemes and doing some sort of unicode canonicalizing, there's no good way to detect anagrams containing such characters. For example, the instructions imply sorting is a valid approach, but that would separate such an accent from its character. You're also passing all the test cases even though you're pre-filtering on str length, so they're most likely not giving you that use case anyway.

Thus you can probably just discard any inputs that have a such a lowercase result, e.g. using something like

/// Find the lowercase version of a character.
///
/// Returns `Err(())` if the lowercase of the given character consists of
/// multiple Unicode scalar values.
fn try_lowercase_char(ch: char) -> Result<char, ()> {
    match ch.is_uppercase() {
        false => Ok(ch),
        true => {
            let mut iter = ch.to_lowercase();
            let ch = iter.next().ok_or(())?;
            match iter.next() {
                None => Ok(ch),
                _ => Err(())
            }
        }
    }
}

Setting aside the problem itself, you can use Rc<str> for something that is cheap to clone and does so without additional allocation, sure. Just use Rc<str> instead of String (or char), e.g.

fn to_lowercase_if_not_already(ch: char) -> Rc<str> {
    if ch.is_uppercase() {
        ch.to_lowercase().to_string().into()
    } else {
        ch.to_string().into()
    }
}

You're right, I was able to simplify the code as follows. I verified that the same works with Rc as well.

use std::collections::{hash_map::HashMap, HashSet};

pub fn anagrams_for<'a>(word: &str, possible_anagrams: &[&'a str]) -> HashSet<&'a str> {
    let mut freq_map: HashMap<char, u32> = HashMap::new();
    let mut indices: Vec<char> = Vec::new();

    word.chars().for_each(|ch| {
        let c = to_lowercase_if_not_already(ch);
        *freq_map.entry(c).or_default() += 1;
        indices.push(c);
    });

    possible_anagrams
        .iter()
        .filter(|candidate| {
            word.len() == candidate.len() && is_anagram(&freq_map, &indices, candidate)
        })
        .copied()
        .collect()
}

fn to_lowercase_if_not_already(ch: char) -> char {
    if ch.is_uppercase() {
        ch.to_lowercase().next().unwrap()
    } else {
        ch
    }
}

fn is_anagram(freq_map: &HashMap<char, u32>, indices: &[char], candidate: &str) -> bool {
    let mut f_map: HashMap<char, u32> = HashMap::new();
    let mut score = freq_map.len();
    let mut same = true;

    for (i, ch) in candidate.chars().enumerate() {
        let c = to_lowercase_if_not_already(ch);
        if !freq_map.contains_key(&c) || f_map.get(&c) == freq_map.get(&c) {
            return false;
        }
        *f_map.entry(c).or_default() += 1;
        if f_map[&c] == freq_map[&c] {
            score -= 1;
        }
        same &= c == indices[i];
    }
    !same && (score == 0)
}

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.