TLDR
How can you inject functions and closures of different types into a fixed chain of iterator methods?
Context
The broad outline looks something like this:
-
The user specifies an arbitrary number of files from which data should be read.
-
Data should be extracted from these files, but the user can specify, at runtime, which kind of data are to be extracted.
-
These data should be combined into one continuous stream, comprising the data read from all the files.
-
The data should be grouped according to a strategy that is appropriate for the choice made in step 2.
-
These groups should be processed according to a strategy that matches the choice in step 2.
In pseudo-Rust:
let(read_strategy, group_strategy, processing_strategy) =
get_strategies_from(cli_args);
cli_args.infiles.iter()
.flat_map(read_strategy)
.group_by(group_strategy).into_iter()
.map(processing_strategy)
.collect::<Vec<_>>();
I'm omitting details (mostly intermediate steps which update progress bars and gather statistics) which complicate my real code, but I hope that these are not directly pertinent here.
I include some working, self-contained Python code which demonstrates what I'm trying to achieve, below.
Problems
The different read-strategies will, in general, extract different types. This pushes me towards dynamic dispatch. The strategies depend on other data supplied at run-time, and I'm finding myself needing to use closures in order to get the strategies to capture these data, but I'm struggling to
- make polymorphic iterator-consuming/returning closures,
- declare the polymorphic types of
read_strategy
,group_strategy
andprocessing_strategy
, - get the compiler to accept their combination.
I'm trying to use impl Iterator<Item = T>
but get lots of
`impl Trait` not allowed outside of function and inherent method return types
Outline of idea in Python
from itertools import groupby, chain
from collections import namedtuple
# Python appears to have no standard flat_map
def flat_map(func, *iterable):
return chain.from_iterable(map(func, *iterable))
# Imitate Rust's collect
collect = tuple
# ======================================================================
# Imagine we have an arbitrary number of files from which we want to read some
# data. We represent them in-memory as a dictionary of file-name => contents
input_files = dict(file1 = range(100),
file2 = range(100, 200),
file3 = range(200, 300))
# In real life, the user would supply these on the CLI
filenames = tuple(input_files)
# ======================================================================
# There are different types of information we might extract from the file.
# Let's represent these by two different types:
Foo = namedtuple('Foo', 'f')
Bar = namedtuple('Bar', 'b')
# Alternative strategies for extracting data from the file: either Foos or Bars
def extract_foos_from_file(filename): return map(Foo, input_files[filename])
def extract_bars_from_file(filename): return map(Bar, input_files[filename])
# Strategies for grouping the data together, depending on what we extracted
def group_foos(f): return f.f // 10
def group_bars(b): return b.b // 10
# Strategies for processing the information extracted from the files
def process_foo_group(k_g): return sum(foo.f for foo in k_g[1])
def process_bar_group(k_g): return sum(bar.b for bar in k_g[1])
# ======================================================================
# The broad outline of how the data are processed remains the same, but the
# details pertaining to what data should be extracted, how they should be
# grouped, and how they should be processed can vary.
for extract, group, process in ((extract_foos_from_file, group_foos, process_foo_group),
(extract_bars_from_file, group_bars, process_bar_group)):
# This sequence of operations (which I'm trying to express in Rust by
# chaining iterator functions) is invariant: the difference lies in which
# versions of `extract`, `group` and `process` are to be used.
data = flat_map(extract, filenames)
grouped = groupby(data, group)
processed = map(process, grouped)
result = collect(processed)
print(result)