Is there an itertools fn for this?

zeroexcuses · October 15, 2022, 3:17am

Basically we want to take a list, without changing the order of the elements, chop it up into groups, where each group has size <= some specified value.

We can assume that all individual items <= that specified value.

Iterative code would look something like the following. I am wondering if there is a cleaner itertools approach.

pub struct Foo {
  ...
  weight: usize;
}

pub fn sorta_group_by(n: bucket_size, in: Vec<Foo>) -> Vec<Vec<Foo>> {
  let mut ans = vec![];
  let mut cur_bucket = vec![];
  let mut cur_size = 0;

  for x in in.into_iter() {
    if cur_size + x.weight >= n {
      ans.push(cur_bucket);
      cur_bucket = vec![];
      cur_size = 0;
    }
    cur_size += x.weight;
    cur_bucket.push(x);
  }

  if cur_bucket.len() > 0 {
    ans.push(cur_bucket);
  }

  ans
}

Aiden2207 · October 15, 2022, 4:04am

The best itertools can do is a stateful group_by:

pub fn sorta_group_by(n: usize, input: Vec<Foo>) -> Vec<Vec<Foo>> {
    let mut cur_size = 0;
    let mut counter = 0;
    input
        .into_iter()
        .group_by(|foo| {
            debug_assert!(foo.weight < n); //optional sanity check
            if foo.weight + cur_size >= n {
                counter += 1;
                cur_size = 0;
            } else {
                cur_size += foo.weight;
            }
            counter
        })
        .into_iter()
        .map(|(_, iter)| iter.collect())
        .collect()
}

Playground

8573 · October 15, 2022, 4:06am

That sounds like the chunks method on slices.

zeroexcuses · October 15, 2022, 4:16am

I think my use of the word 'size' confused you; each item has a weight; we want the sum of weights of each group to be <= some specified value.

quinedot · October 15, 2022, 4:33am

Seems simple enough to DIY?

zeroexcuses · October 15, 2022, 2:34pm

@quinedot : I know how to use traits + closures, but a bit confused when it comes to mixing it with custom written iterators. Could you take this one more step and demonstrate how to write this if the usage of group_by_weight was ?

group_by_weight(|x| x.weight)

I.e. instead of hard coding a struct Foo, we want to support arbitrary T but then have a closure Fn which maps the &T to a usize weight.

Thaniks!

zeroexcuses · October 15, 2022, 2:38pm

Also, is

impl<T: ?Sized + AsRef<[Foo]>> GroupBy for T {

a bit overly strict? Can we generalize to any iterator over T, rather than require AsRef<[Foo]>

quinedot · October 15, 2022, 6:17pm

Sure, here's this one (still restricted to slices).

This would require some careful documentation though... it calls the function on some items more than once. This could be avoided by storing an Option<usize> to indicate the value of the overflow item with enough care. OK, that wasn't too hard either. I'll leave the other version too in case it's helpful to see the progression from the less generic version onward.

Can we generalize to any iterator over T

It's easy to return subslices from a slice, but for an arbitrary iterator, we're going to have to store the dynamically sized (dynamic number of returned values per iteration) somewhere. Your OP used Vecs anyway, so probably this is acceptable to you.

Here's a version for that. In this one I did take care to only call the closure on each item once from the start; it was actually more natural here as we just call the closure as we consume the inner iterator.

Now that I've done it, just looking at what I called it -- collect_by_sum -- this could be generalized even further to return arbitrary collections ala collect, at the cost of inference. But I'll stop here.

Let me know if you want me to walk through anything in more detail.

zeroexcuses · October 16, 2022, 12:01am

Thanks for the code. I thought I was proficient at Rust, but there's no way I can just write new iterators like that off the top of my head.

Not to pedantically nitpick, but I have a few questions about design choices you made to see if I misunderstood something:

EDIT: I'm wrong on point 1. Did not see the storage.push.
In the last line of the following, Why do we use FnMut instead of Fn here:

impl<I: Iterator> IteratorExt for I {
    fn collect_by_sum<F>(self, limit: usize, func: F) -> CollectBySum<Self, <Self as Iterator>::Item, F>
    where
        F: FnMut(&<Self as Iterator>::Item) -> usize

My intuition here is that we should only ever pass closures that are "read-only"; I don't see why we are allowing closures that mutate the item.

Because of:

impl<Iter, Item, F> Iterator for CollectBySum<Iter, Item, F>
where
    Iter: Iterator<Item = Item>,
    F: FnMut(&Item) -> usize,
{
    type Item = Vec<Item>;
    fn next(&mut self) -> Option<Self::Item> {

you code is actually an improvement over mine, in that yours runs "lazily" right? In that it ever only builds the groups/chunks that are needed (whereas mine brute forces all groups/chunks, even if we only need the first one.)

Thanks!

zeroexcuses · October 16, 2022, 12:07am

I'm re-reverting on point 1. Overflow can be just Option<(Item, usize)> right? The following appears to work for me:

struct Collect_By_Sum<Iter, Item, F> {
    limit: usize,
    inner: Iter,
    func: F,
    overflow: Option<(Item, usize)>,
}

Then we also make this change here:

    fn next(&mut self) -> Option<Self::Item> {
        let (mut storage, mut sum) = match self.overflow.take() {
            None => (Vec::new(), 0),
            Some((i, n)) => (vec![i], n),};

It seems that the overflow at most ever stores 1 Item.

zeroexcuses · October 16, 2022, 12:16am

@quinedot This is now a bit of a soft question. I am trying to reverse engineer why you could so easily translate my original code to an iterator, whereas I had a mental block. For you, was the process something like this:

We define the output iterator, so we have:

pub struct Collect_By_Sum {}

We have it implement Iterator, now we have:

pub struct Collect_By_Sum {}
impl <Iter, Item> Iterator for Collect_By_Sum where
  Iter: Iterator<Item = Item>
{
  type Item = ...;
  pub fn next(&mut self) -> Option<Self::Item> {}
}

We also need to pass in a closure, so

pub struct Collect_By_Sum {}
impl <Iter, Item, F> Iterator for Collect_By_Sum where
  Iter: Iterator<Item = Item>
  F: Fn(&Item) -> usize
{
    type Item = Vec<Item>;
    pub fn next(&mut self) -> Option<Self::Item> {}
}

and now we take the original imperative code I had, and make it compute each output of the solution lazily ?

cuviper · October 16, 2022, 12:36am

zeroexcuses:

In the last line of the following, Why do we use FnMut instead of Fn here:
impl<I: Iterator> IteratorExt for I {
    fn collect_by_sum<F>(self, limit: usize, func: F) -> CollectBySum<Self, <Self as Iterator>::Item, F>
    where
        F: FnMut(&<Self as Iterator>::Item) -> usize
My intuition here is that we should only ever pass closures that are "read-only"; I don't see why we are allowing closures that mutate the item.

FnMut allows that closure to modify its own captured state, in case it has a counter or something. It still only receives an immutable reference to the item, &Item, so it can't modify that.

zeroexcuses · October 16, 2022, 12:49am

Thanks, that clarifies a lot. I was confusing FnMut (the closure type) with &Item (the ref type the closure received).

quinedot · October 16, 2022, 7:43pm

Just a quick note that I haven't forgotten this thread; I'll write something up when I have enough contiguous free time at a computer.

zeroexcuses · October 16, 2022, 8:47pm

No worries. Thank you for all your hard work in writing the original 3-4 iterator variations. I learned quite a bit in studying your final implementation.

quinedot · October 20, 2022, 8:39am

So, it sounds like you haven't had much experience writing custom iterators. It seems to intimidate people for some reason, but I also think Rust makes it pretty easy.

Mechanical start for any iterator

Start with a skeleton:

struct MyIter;
impl Iterator for MyIter {}

Copy, paste, and adust from the error output:

struct MyIter;
impl Iterator for MyIter {
    type Item = String; // or whatever; presumably you know what you want
    fn next(&mut self) -> Option<<Self as Iterator>::Item> {
        todo!()
    }
}

And now "all" you have to do is supply the state fields and logic. (And once you've written a few, you'll have memorized the pattern and can just slam out the framework without consulting the compiler errors.)

Probably the most challenging part is getting in the habit of maintaining state so you can return one item at a time.

Using a slice for your state is usually pretty straight-forward as a slice is sort of a super-iterator (easy to split off elements or subslices, random access, et cetera), and that's what I started with. The state is just remaining elements, and the return was some prefix of the slice. I half-ignored your OP here really, by returning borrowed values versus an owned Vec<Vec<_>>. The core of the idea was really "it's straightforward to split a slice based on the sum exceeding some bound".

        let mut sum = 0;
        match self.remaining.iter().position(|foo| {
            sum += foo.weight;
            sum > self.limit
        }) {

Then in this post, I addressed your additional constraints. I'll recreate the process as best as I remember.

First I iterated on my original playground to handle the arbitrary type with a "getter" closure. Clearly, Foo would have to be replaced by a type parameter, and we'd need another for the closure too (if I was going to keep using the slice as my state, which would be the most direct change).

struct GroupBySum<'a, T, F> {
    limit: usize,
    remaining: &'a [T],
    func: F,
}

Adjusting the Iterator implementation was pretty much again a mechanical exercise that resulted in the implementation that calls the closure multiple times on overflow elements. It didn't even occur to me on the first pass. The state is still "slice of unreturned elements".

I used FnMut as it's the most flexible for callers, like someone else covered above. You rarely need the flexibility of Fn in my experience, and this allows callers to use stateful closures without resorting to interior mutability.

Next I moved on to the "make this work for any iterator, not just slices" constraint. I found it clear from the start that a Vec would be needed; that's pretty much what comes out of "collect into groups of dynamically determined sizes (our return Item) from an element-by-element iterator". So I started with

struct CollectBySum<Iter, F> {
    limit: usize,
    inner: Iter, // replaces the slice
    func: F,
}

And because of the nature of the problem -- take items until overflow and then group everything except the overflow item -- my first thought was to use a peekable iterator. I think I started writing that even, but that's when I realized that I would be calling the closure on the overflow element more than once. I could have continued anyway by just storing the overflow size, but decided it would be more straight-forward to just forego peeking and store the element myself as well.

struct CollectBySum<Iter, Item, F> {
    // ...
    overflow: Option<(Vec<Item>, usize)>,
}

I think the reason I went with Vec<Item> is it fell out naturally from writing

        // Start with the overflow from the last call, or with nothing
        let (mut storage, mut sum) = self
            .overflow
            .take()
            .unwrap_or_else(|| (Vec::new(), 0));

Without the slice, position and slice splitting isn't an option, so I wrote the for loop and moved the "empty or not tail" check to the end.

Then I went back and modified the slice version to not call the closure more than once on each element by following the same concept of take-ing the overflow or not using slices and an offset of 1 or 0.

Aside from backtracking on Peekable and realizing I'd made an arguable goof by calling the closure on the same element more than once initially, I think the biggest stumbles were getting my generic parameters declared in the right spots. I remembering getting an "unconstrained type parameter" on some attempt but not which one or where (Iterator implementation or helper / IteratorExt trait).

Now to try and connect this back to your process question. The process I went through was iterative (heh), so it was more like "replace &[Foo] with &[T]", and then "replace &[T] with Iter: Iterator". But if I had started with the ending constraints, it probably would have gone mostly like you describe. Many custom iterators do actually wrap another iterator, so it's pretty common to have

struct MyIter<Iter> {
    inner: Iter,
}

impl<Iter: Iterator> Iterator for MyIter<Iter> {
    type Item = SomethingInvolving<Iter::Item>;
    fn next(&mut self) -> Option<Self::Item> { todo!() }
}

Or perhaps

impl<Iter: Iterator<Item = ConcreteType>> Iterator for MyIter<Iter> {
    type Item = SomethingInvolving<ConcreteType>;
    fn next(&mut self) -> Option<Self::Item> { todo!() }
}

(And this common pattern of wrapping an inner iterator is also why I likened slice-based versions to having a "super-iterator" for state.)

So my first stab probably would have looked like this.

What do I need for state?
- The limit
- An inner iterator
- A closure to get a usize out of an &Iter::Item
That gives the struct (leave off the bounds)
What bounds do I need on the implementation?
- Iter: Iterator
- F: FnMut(&Iter::Item) -> usize

When moving on to managing the state in next, the need for dealing with the overflow item without returning it would inevitably come up. With the direction I took, Item ended up as a parameter to CollectBySum and that changed the bounds a bit. If I had gone with Peekable and only stored a usize for overflow, I don't think the bounds would have changed. Or I could have added bounds to the struct definition and used Iter::Item instead of type parameter, but I guess I'm in the habit of avoiding bounds on structs strongly enough that I don't think it occurred to me at the time.

These adjustments from the initial skeleton are more general Rust adjustments than Iterator specific adjustments, IMO.

Once I understood the goal, I didn't consider or try to adapt your imperative code, I just thought of how to achieve calculation of each item (group within a given sum). Again, this might be the biggest challenge of getting the hang of iterators -- thinking in terms of maintaining state so you can generate one item at a time.

That said, in this case the end result wasn't really that far off from your imperative code, looking back at it.

system · January 18, 2023, 8:40am

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.

Topic		Replies	Views
Getting data back out of itertools group_by method help	3	660	January 12, 2023
Using an Iterator to group structured data	3	835	July 7, 2019
Iterator: combine skip & take	7	1638	December 10, 2019
Iterating over custom chunks help	3	216	November 13, 2023
How to breakup an iterator into chunks? help	9	1708	April 20, 2023

Is there an itertools fn for this?

Related Topics