Itertools group_by confusion

Hi everyone, I've decided to stop banging my head against the wall on this one.

I'm trying to use the group_by method from itertools. I thought I understood how it works, but I must be doing something wrong...

I'm iterating over some data on the form of ((x: i32, y: i32), z: i32), and I'd like to use group_by on the z element. This is where it all falls apart. Sometimes, the output is correctly grouped, but sometimes elements with the same key value end up in different groups. If you take a look at the output below you'll understand. Try running the snippet a few times in the Playground. I'm guessing you'll get various results.

So my question is: What am I doing wrong?

Edit: Reading through the docs it seems like the magic words are consecutive elements.

use std::collections::HashSet;
use itertools::Itertools;

fn main() {
    let s = vec![((0, 1), 1), ((0, 2), 2), ((1, 2), 2), ((3, 1), 1)].into_iter().collect::<HashSet<_>>();
    
    println!("{:?}\n", s);
    
    s.iter()
        .group_by(|&x| x.1)
        .into_iter()
        .map(|(k, r)| (k, r.collect::<Vec<_>>()))
        .for_each(|x| println!("Key: {:?} | Group: {:?}", x.0, x.1));
}

(Playground)

Output:

{((3, 1), 1), ((0, 2), 2), ((0, 1), 1), ((1, 2), 2)}

Key: 1 | Group: [((3, 1), 1)]
Key: 2 | Group: [((0, 2), 2)]
Key: 1 | Group: [((0, 1), 1)]
Key: 2 | Group: [((1, 2), 2)]

Errors:

   Compiling playground v0.0.1 (/playground)
    Finished dev [unoptimized + debuginfo] target(s) in 1.31s
     Running `target/debug/playground`

I got bit by this once a while back, probably due to my familiarity with ruby's Enumerable method of the same name.

I suspect the reason for this is that haskell's groupBy groups consecutive elements as well.

Also, restructuring the input data as a HashMap of vectors is also a pretty heavyweight operation that can't be done lazily. (so it can't offer a simple iterator adapter)

itertools does have an option for you, however: into_group_map.

EDIT: playground example.

1 Like

Related: RFC for adding group_by to rust std. I sort of regret the bike-shedding in there -- I think it'd be a great addition to the standard library, despite its name colliding with ruby's!

I'd say there's a good reason for that in itself, other than "because Haskell". A general take-it-all group_by would need to allocate and exhaust the whole iterator eagerly, in order to put the elements in a map of some sort so as to group all of them. This:

  • wastes memory if you already have pre-sorted/pre-grouped data;
  • wastes time if you only want to consume part of the iterator;
  • imposes an additional Ord or Hash constraint on the key type.

In contrast, if it works on consecutive elements, it can still operate lazily, with O(1) memory and only a K: PartialEq bound, and it's trivial to opt-in to the globally-grouping and allocating behavior by collecting the iterator into a vector then sorting it.

6 Likes

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.