Polymorphic iterator-using closures

TLDR

How can you inject functions and closures of different types into a fixed chain of iterator methods?

Context

The broad outline looks something like this:

  1. The user specifies an arbitrary number of files from which data should be read.

  2. Data should be extracted from these files, but the user can specify, at runtime, which kind of data are to be extracted.

  3. These data should be combined into one continuous stream, comprising the data read from all the files.

  4. The data should be grouped according to a strategy that is appropriate for the choice made in step 2.

  5. These groups should be processed according to a strategy that matches the choice in step 2.

In pseudo-Rust:

let(read_strategy, group_strategy, processing_strategy) = 
    get_strategies_from(cli_args);

cli_args.infiles.iter()
    .flat_map(read_strategy)
    .group_by(group_strategy).into_iter()
    .map(processing_strategy)
    .collect::<Vec<_>>();

I'm omitting details (mostly intermediate steps which update progress bars and gather statistics) which complicate my real code, but I hope that these are not directly pertinent here.

I include some working, self-contained Python code which demonstrates what I'm trying to achieve, below.

Problems

The different read-strategies will, in general, extract different types. This pushes me towards dynamic dispatch. The strategies depend on other data supplied at run-time, and I'm finding myself needing to use closures in order to get the strategies to capture these data, but I'm struggling to

  • make polymorphic iterator-consuming/returning closures,
  • declare the polymorphic types of read_strategy, group_strategy and processing_strategy,
  • get the compiler to accept their combination.

I'm trying to use impl Iterator<Item = T> but get lots of

`impl Trait` not allowed outside of function and inherent method return types

Outline of idea in Python

from itertools import groupby, chain
from collections import namedtuple

# Python appears to have no standard flat_map
def flat_map(func, *iterable):
    return chain.from_iterable(map(func, *iterable))

# Imitate Rust's collect
collect = tuple

# ======================================================================

# Imagine we have an arbitrary number of files from which we want to read some
# data. We represent them in-memory as a dictionary of file-name => contents
input_files = dict(file1 = range(100),
                   file2 = range(100, 200),
                   file3 = range(200, 300))

# In real life, the user would supply these on the CLI
filenames = tuple(input_files)

# ======================================================================

# There are different types of information we might extract from the file.
# Let's represent these by two different types:
Foo = namedtuple('Foo', 'f')
Bar = namedtuple('Bar', 'b')

# Alternative strategies for extracting data from the file: either Foos or Bars
def extract_foos_from_file(filename): return map(Foo, input_files[filename])
def extract_bars_from_file(filename): return map(Bar, input_files[filename])

# Strategies for grouping the data together, depending on what we extracted
def group_foos(f): return f.f // 10
def group_bars(b): return b.b // 10

# Strategies for processing the information extracted from the files
def process_foo_group(k_g): return sum(foo.f for foo in k_g[1])
def process_bar_group(k_g): return sum(bar.b for bar in k_g[1])

# ======================================================================

# The broad outline of how the data are processed remains the same, but the
# details pertaining to what data should be extracted, how they should be
# grouped, and how they should be processed can vary.
for extract, group, process in ((extract_foos_from_file, group_foos, process_foo_group),
                                (extract_bars_from_file, group_bars, process_bar_group)):
    # This sequence of operations (which I'm trying to express in Rust by
    # chaining iterator functions) is invariant: the difference lies in which
    # versions of `extract`, `group` and `process` are to be used.
    data      = flat_map(extract, filenames)
    grouped   = groupby(data, group)
    processed = map(process, grouped)
    result    = collect(processed)
    print(result)

You may get more traction if you can post an example of one of your attempts in Rust that is producing these errors.

Yes. Good point. Perhaps this contains enough to illustrate the point: playground.

Lines 20-23 and 26-29 illustrate the monomorphic versions working.

Uncommenting line 34 produces a first compiler error. From that starting point I get an exponentially growing tree of failed attempts.

You're basically running into Rust's strict typing left and right, which isn't surprising given the polymorphic nature of this approach I guess. You can get a certain distance with generics given that example, something like

// You would probably have `PathBuf` or such, not `&'static str`
fn do_job<I, J, R, T>(files: I, reader: R)
where
    // This could be non-generic if it's always the same
    I: IntoIterator<Item=&'static str>,
    // These two are the extractor basically
    R: FnMut(&'static str) -> J,
    J: Iterator<Item=T>,
    // Properties you need from your extracted stuff
    T: Debug,
{
    let combined: Vec<_> = files.into_iter().flat_map(reader).collect();
    println!("{:?}", combined);
}

// ...
    do_job(filenames.iter().copied(), read_foos);
    do_job(filenames.iter().copied(), read_bars);

But you can't put read_foos and read_bars in the same slice because they're different types, and you can't really hide that behind an abstraction because their outputs have intentionally different types, and that type is something that you need to convey to the job runner.

You could perhaps store high-level jobs, as that seems to be the level of commonality in types (turn files into result):

trait Job {
    do_job<I: IntoIterator<Item=PathBuf>>(files: I) -> f32;
}

// `impl`s of `Job` are built up from extractors, groupers, processors
// Then you store `dyn Job`s

But you're still going to have to deal with how to build these jobs. Because the output type of one task (extractor, grouper) is the input to the next, the tasks chosen have to agree with each other within a given job. You can do this with the type system more or less directly, but basically you have to handle ever possible combination of steps (either the types line up so build it, or they don't so error at runtime). Or you could do something like, all the steps compose with each other by handling every possible input type, and the incompatible combinations are detected by producing an Err() by default.

A related approach to the latter option would be to eliminate the different types by having a single type represent all the possible outputs of extraction, say. Maybe an enum or a HashMap, depending on the process space.


If the way the extracted types are grouped is similar enough, perhaps you could have a trait for extracted items (Groupable), and your extraction steps could produce Box<dyn Groupable>. And, similarly for the output of the grouping step (Processable).


This is all very vague, I know. It's just hard to tell what a good approach might be without knowing more about the problem space at hand, for me anyway.

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.