Iterators over for..in

I have this function that counts frequencies of case-insensitive-letters of some text.

fn frequency(text: &[&str]) -> HashMap<char, usize> {
    let mut freq = HashMap::new();
    for line in text {
        for ch in line.to_lowercase().chars() {
            if ch.is_alphabetic() {
                freq.entry(ch).and_modify(|cnt| *cnt += 1).or_insert(1);
            }
        }
    }
    freq
}

I want to rewrite this in iterator style. So I (try to) do

    text.into_iter()
        .flat_map(|line| line.to_lowercase().chars()) // bug!
        .filter(|ch| ch.is_alphabetic())
        .fold(HashMap::new(), |mut freq, ch| {
            freq.entry(ch).and_modify(|cnt| *cnt += 1).or_insert(1);
            freq
        })

This does not compile because of a reference to temporary value in the closure to flat_map! Here's one that does compile

    text.into_iter()
        .fold(HashMap::new(), |freq, line| {
            line.to_lowercase()
                .chars()
                .filter(|ch| ch.is_alphabetic())
                .fold(freq, |mut freq, ch| {
                    freq.entry(ch).and_modify(|cnt| *cnt += 1).or_insert(1);
                    freq
                })
            })

but this performs significantly slower than the original loop version.

There were so many claims that iterators would outperform hand rolled loops. Am I missing an angle here?

The reason that your first attempt resulted in a compiler error is because to_lowercase() actually returns an owned String. So when you returned an iterator of chars to the string in the flat_map, it would mean that the String returned from to_lowercase() would be free, as the function has ended and the owner has not been transfer.
This will result in a dangling pointer, hence the compiler error.

But regarding your second problem, I did a test on the rust playground, and found that the iterator version is actually faster by around 10%. So the first question is, are you running it with the --release flag with cargo run --release ?

But really, if you want this to be faster, I would look into using char::to_lowercase() or char::to_ascii_lowercase(), as you can avoid that allocation, which is where most of the time is taking. The following example runs around 2x the allocated version in the example within the playground.

fn frequency_notalloc_iter(text: &[&str]) -> HashMap<char, usize> {
    text.into_iter()
        .flat_map(|line| line.chars())
        .map(|ch| ch.to_ascii_lowercase())
        .filter(|ch| ch.is_ascii_lowercase())
        .fold(HashMap::new(), |mut freq, ch| {
            freq.entry(ch).and_modify(|cnt| *cnt += 1).or_insert(1);
            freq
        })
}
3 Likes

My instinct, which may be overkill in this case, is to define a frequency-counting map type. I haven't tested the performance, though.

use std::collections::HashMap;
use std::hash::Hash;
use std::iter::FromIterator;

fn frequency(text: &[&str]) -> FreqMap<char> {
    text.iter().copied()
        .flat_map(str::chars)
        .flat_map(char::to_lowercase)
        .filter(|c| c.is_alphabetic())
        .collect()
}

pub struct FreqMap<T>(HashMap<T,usize>);

impl<T:Hash+Eq> FreqMap<T> {
    pub fn get(&self, t:&T)->usize {
        self.0.get(t).cloned().unwrap_or(0)
    }
}

impl<T> Default for FreqMap<T> {
    fn default()->Self { FreqMap(Default::default()) }
}

impl<T> FromIterator<T> for FreqMap<T> where Self: Extend<T> {
    fn from_iter<I>(iter: I) -> Self where
        I: IntoIterator<Item = T>,
    {
        let mut result:Self = Default::default();
        result.extend(iter);
        result
    }
}

impl<T:Hash+Eq> Extend<T> for FreqMap<T> {
    fn extend<I>(&mut self, iter: I) where
        I: IntoIterator<Item = T>,
    {
        for t in iter {
            *self.0.entry(t).or_default() += 1;
        }
    }
}

impl<'a, T:Hash+Eq+Clone+'a> Extend<&'a T> for FreqMap<T> {
    fn extend<I>(&mut self, iter: I) where
        I: IntoIterator<Item = &'a T>,
    {
        // TODO: Rewrite to avoid unnecessary clones
        self.extend(iter.into_iter().cloned());
    }
}

(Playground)

1 Like

Opps, indeed.
I've fixed my example.

1 Like

Yes I understand the bug. It was included to lead into the next solution. I'm not for performance per se, but rather astonished that I couldn't write an iterator solution that should have outperformed a same-logic-loop-solution.

Your playground time assessments are not accurate. Had you swapped the order you would get different results.

Thank you. This is nice and all. But does not address the issue. How could I write an iterator solution that captures the same logic and outperforms the loop solution? I Specifically called str::to_lowercase not char::to_lowercase.

It's not necessarily possible. There's nothing magic about iterators, and Rust's for loops use them under the hood anyway. As far as I understand it, the way most iterator-based solutions gain performance is by short-circuiting unnecessary operations; I don't see how that could apply to this problem.

If any significant work is happening in the loop body (Allocating on the heap, for instance), the performance difference between iterators and loops is likely to be lost in the noise.

3 Likes

Iterators, loops, and recursion are all mathematically equivalent, and can be transformed into each other as optimizations. The choice of representation has no inherent performance value, it's just notation. And on that front, once you start using non-trivial folds (or custom adapters), you're really stretching the limits of where iterators are considered a good notation.

1 Like