How to count multiple filters of same iterator?

I have an iterator impl Iterator<Item=T> and two filters:

fn filter1<T, F>(iter: impl Iterator<Item=&T>) -> impl Iterator<Item=F> {}
fn filter2<T, G>(iter: impl Iterator<Item=&T>) -> impl Iterator<Item=G> {}

How can I count the two filters with iter iterated exactly once?

If you mean "count how many items filter1 and filter2 each produce": it is not possible without going to costly lengths, such as:

  • collecting iter into a temporary Vec<&T>
  • spawn two scoped threads, each of which runs one of the filters against a channel-receiver iterator, then iterate the original iter and send its items to both channels

It will be much better if you can write the filters as standalone functions (or closures if they need to be stateful) that do not capture the iterator.

Is there something theoretical or a counterexample to show this is impossible? The filters are provided by external crates and are not supposed to be modified.

Iterating the base iterator using the first filter will consume the base iterator. So you can't feed the base iterator into the second second filter, since the base iterator has been consumed.

You may be asking for something you haven't described fully. It may help to give a concrete example of the input and outputs.

1 Like

The interface is awkward for this use-case.
If we assumes that these filter functions actually behave as simple stateless filters, then we can drive the iteration ourselves and call them repeatedly with single-item iterators:

    let mut count1 = 0;
    let mut count2 = 0;
    
    for val in input {
        if filter1([&val].into_iter()).next().is_some() {
            count1 += 1;
        }
        
        if filter2([&val].into_iter()).next().is_some() {
            count2 += 1;
        }
    }

    println!("{} {}", count1, count2);

That's a big assumption though. What if filter1 actually returns more items than in its input? Or if filter2 is supposed to keep track of previously seen items in order to decide which items to return and which not? (The output of the above code would just be non-sense.)

2 Likes

@drivel @jumpnbrownweasel Thank you for your reply. :slight_smile:

In fact, I ask this problem as a generic question instead of a concrete situation. In my understanding, if filter1 and filter2 are white box, we can always rewrite the code to a for-loop to meet the requirement (more generally, they may not even be "filter"), no matter whether it outputs more elements or tracks previous items (I really cannot come up with a counterexample), as it really consumes an iterator of &T instead of T.

The "consumption" explanation cannot convince me enough since it is an intuitive explanation instead of a logic explanation for me (sorry for being stubborn).

Since the white-box re-write can fix the problem, I wonder if there is a more formal transform-based way to act on a black-box iter-to-iter func?

Actually, on second thought, there is something you can in principle do. You could make a “one into two iterators” adapter that presented itself as being two iterators that produce the same items. Internally, it would call next() as needed on the underlying iterator. But — what happens if filter1 calls next() three times while filter2 is doing nothing? Then this adapter has consumed three items from the original iter, and those three items need to be presented to filter2 the next time filter2 calls next(). There are two choices for how to handle this:

  • The adapter would have to buffer the unequally-consumed items. This is possible, but in the worst case it has cost equivalent to just collect()ing the original iterator and iterating over it twice.
  • The adapter would have to block inside next() until each item is consumed by both outputs. This will deadlock unless you spawned two threads for the two counting operations.

Note also that many iterators which produce references are cheap to create or even clone()able, so "iterating twice" is likely to be an efficient, allocation-free solution.

There is a code transformation that can work: coroutines/continuation-passing. Once you have applied that transform you can express the “spawn two scoped threads” option I mentioned up above without actually spawning any threads. However, Rust doesn't allow you to take a function that already exists and apply that transform — you'd have to modify the filter1 and filter2 code to make it async code that operates on a Stream instead of an Iterator.

So, yes, there is a transform that works at the abstract level of “what is computationally possible”. But it is not a transform you can apply in your specific context “the filters are provided by external crates and are not supposed to be modified”.

4 Likes

Zero idea as to what's not "logical" enough about that. If you consume an iterator, it won't yield more elements, so any counting that takes place after that would yield zero, which is always wrong (unless the iterator was empty in the first place).

Thank you. I will check the "continuation" concept. :slight_smile:

Another big-picture explanation is: Iterator is a pull-based interface. That is, the owner of an Iterator gets to choose when items are taken from it. Thus, the conflict is between filter1 and filter2 each having their own plans for when to take items from their underlying iter; at each point in the overall process, either filter1's code is running and filter2 gets no say, or filter2's code is running and filter1 gets no say.

Threads and coroutines break the "gets no say" by allowing filter1 and filter2 to be running concurrently. Defining the filters in a different way than fn(Iterator) -> Iterator breaks the “gets no say” by letting some outer code decide when each of the filters gets additional input, rather than delegating full control to the filters.

1 Like

That's actually trivial:

count1 += filter1([&val]).count();

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.