Group_by challenge

One of the small, but tricky pieces of code I occasionally have to write is a version of Python's groupby function.

This function takes a sequence and splits it into runs based on a particular key function. In the simplest case, it just splits the sequence into the runs of equal elements: [1, 1, 1, 2, 3, 3, 2] -> [[1, 1, 1], [2], [3, 3], [2]].

I always have a hard time implementing it manually, because it has this annoying problem when you forget to deal with the last group, and then have to duplicate the code for finishing the group (see, for example, this code from rust-analyzer where I've just macro-ed away the duplication).

I wonder if folks here could suggest nice solutions!

Here's the signature and some tests:

pub fn group_by<I, F, K, T>(xs: I, key_fn: F) -> Vec<(K, Vec<T>)>
where
    I: IntoIterator<Item = T>,
    F: Fn(&T) -> K,
    K: Eq,
{
    todo!("make the tests pass")
}

#[test]
fn tests() {
    assert_eq!(
        group_by(0..5, |&x| x % 3 == 0),
        vec![
            (true, vec![0]),
            (false, vec![1, 2]),
            (true, vec![3]),
            (false, vec![4])
        ],
    );
    assert_eq!(
        group_by(0..5, |&x| x % 3 == 1),
        vec![
            (false, vec![0]),
            (true, vec![1]),
            (false, vec![2, 3]),
            (true, vec![4])
        ],
    );
    assert_eq!(
        group_by(0..5, |&x| x % 3 == 0),
        vec![
            (true, vec![0]),
            (false, vec![1, 2]),
            (true, vec![3]),
            (false, vec![4])
        ],
    );

    assert_eq!(
        group_by(0..5, |&x| x),
        vec![
            (0, vec![0]),
            (1, vec![1]),
            (2, vec![2]),
            (3, vec![3]),
            (4, vec![4])
        ],
    );

    assert_eq!(group_by(0..5, |_| ()), vec![((), vec![0, 1, 2, 3, 4])]);

    assert_eq!(group_by(0..0, |_| ()), vec![]);
}

Note that this deliberately doesn't try to produce an iterator, as that requires some extra legwork.

I've also made a git-clonnable repo for convenience: https://github.com/matklad/group-by-challenge

1 Like

Here are two approaches:

pub fn group_by<I, F, K, T>(xs: I, key_fn: F) -> Vec<(K, Vec<T>)>
where
    I: IntoIterator<Item = T>,
    F: Fn(&T) -> K,
    K: Eq,
{
    let mut groups = Vec::new();
    let mut iter = xs.into_iter();
    let mut current: (K, Vec<T>) = match iter.next() {
        Some(item) => {
            let key = key_fn(&item);
            (key, vec![item])
        }
        None => return groups,
    };
    for item in iter {
        let key = key_fn(&item);
        if &current.0 == &key {
            current.1.push(item);
        } else {
            groups.push(replace(&mut current, (key, vec![item])));
        }
    }
    groups.push(current);
    groups
}

Or by using peek:

pub fn group_by<I, F, K, T>(xs: I, key_fn: F) -> Vec<(K, Vec<T>)>
where
    I: IntoIterator<Item = T>,
    F: Fn(&T) -> K,
    K: Eq,
{
    let mut iter = xs.into_iter().peekable();
    let mut groups = Vec::new();
    while let Some(group_first) = iter.next() {
        let key = key_fn(&group_first);
        let mut group = vec![group_first];
        while iter.peek().map(|peek| key_fn(peek) == key).unwrap_or(false) {
            group.push(iter.next().unwrap());
        }
        groups.push((key, group));
    }
    groups
}
1 Like

Itertools does have a lazy group_by.

Mine is a naive, procedural implementation, like a human would do it by hand (playground). I also changed the type of your key function from Fn to FnMut as it's usually advised to use FnMut for greater flexibility unless you need Fn.

pub fn group_by<I, F, K, T>(xs: I, mut key_fn: F) -> Vec<(K, Vec<T>)>
where
    I: IntoIterator<Item = T>,
    F: FnMut(&T) -> K,
    K: Eq,
{
    let mut xs = xs.into_iter();
    let x_prev = match xs.next() {
        None => return Vec::new(),
        Some(x) => x,
    };

    let mut outer = Vec::new();
    let mut k_prev = key_fn(&x_prev);
    let mut inner = vec![x_prev];

    for x_next in xs {
        let k_next = key_fn(&x_next);
        
        if k_next == k_prev {
            inner.push(x_next)
        } else {
            outer.push((k_prev, inner));
            inner = vec![x_next];
            k_prev = k_next;
        }
    }

    outer.push((k_prev, inner));
    
    outer
}

I refactored my first suggestion a bit:

pub fn group_by<I, F, K, T>(xs: I, mut key_fn: F) -> Vec<(K, Vec<T>)>
where
    I: IntoIterator<Item = T>,
    F: FnMut(&T) -> K,
    K: Eq,
{
    let mut groups = Vec::new();
    let mut iter = xs.into_iter().map(move |item| {
        let key = key_fn(&item);
        (key, item)
    });
    if let Some((key, item)) = iter.next() {
        let mut current = (key, vec![item]);
        for (key, item) in iter {
            if &current.0 == &key {
                current.1.push(item);
            } else {
                groups.push(current);
                current = (key, vec![item]);
            }
        }
        groups.push(current);
    }
    groups
}

playground

I found a rather cute way to do it by using Option::filter.

pub fn group_by<I, F, K, T>(xs: I, mut key_fn: F) -> Vec<(K, Vec<T>)>
where
    I: IntoIterator<Item = T>,
    F: FnMut(&T) -> K,
    K: Eq,
{
    let mut groups = Vec::<(K, Vec<T>)>::new();
    for item in xs {
        let key = key_fn(&item);
        let last = groups.last_mut();
        if let Some((_, group)) = last.filter(|(k,_)| k == &key) {
            group.push(item);
        } else {
            groups.push((key, vec![item]));
        }
    }
    groups
}

playground

This is probably my favorite solution so far.

8 Likes

Oh, I like this! This solution indeed does not duplicate calls to key_fn or groups.push.

Adding the group first was the core idea I was missing!

Refactored the real code in rust-analyzer: https://github.com/rust-analyzer/rust-analyzer/pull/2952