Hi all,
I'm playing around with Rust and enjoying some bits and having trouble with others bits. I show up on IRC pretty frequently asking for help and people have been kind enough to help me get through the various compilation errors.
One of the toy examples I'm trying to create is a wordcount example. This reads text, splits it based on whitespace, and then counts the number of times each token appears in the text. I'm having trouble getting it to work using iterators and am wondering if I'm just not understanding this stuff well enough, or if there's really an issue of using iterators on str/String and tuples thereof more painful than it needs to be.
Some background: the simple wordcount basically runs the following function to fill a BTreeMap with a string mapped to a u32 count:
type CountMap = BTreeMap<String, u32>;
fn process_file(path: &Path) -> CountMap {
let file = File::open(&path).unwrap();
let rdr = BufReader::new(file);
let mut mapped : BTreeMap<String, u32> = BTreeMap::new();
for line in rdr.lines() {
match line {
Ok(line_) => {
for token in line_.split(char::is_whitespace) {
// Filter out multiple spaces delimiting to empty strings.
if token.len() > 0 {
*mapped.entry(token.to_owned()).or_insert(0) += 1;
}
}
}
Err(e) => {
println!("Error reading file: {}", e);
panic!("Error!");
}
}
}
mapped
}
```
(Aside: One thing to node here is that splitting on whitespace results in multiple spaces yielding 0 length tokens. I can see this being ok if you have a csv file and have `1,2,,4` buf with whitespace I don't think anyone would expect it to parse `1 2<2 spaces>4` as 4 tokens.)
The simple case was easy to implement. I was really happy with the performance, and so on. But I wanted to try reading the input on multiple threads and aggregate the values. So I wrote most of a version that runs over multiple threads to read in a Hadoop style split file (a directory with multiple file parts calls `part-00000`, `part-00001`, etc). This version runs a scoped thread for each thread part and then uses a [merging iterator adapter](http://bluss.github.io/rust-itertools/doc/itertools/trait.Itertools.html#method.merge_by) to print the results without having to copy all the values into a large tree (in the hopes of making a single pass more efficiently).
However, I'm having a lot of difficulty iterating over the String keys in the BTreeMap key value tuples and my example doesn't build:
```
#![feature(path_ext)]
#![feature(core)]
extern crate itertools;
extern crate core;
use std::borrow::ToOwned;
use std::collections::BTreeMap;
use std::fs::{File, read_dir, PathExt, DirEntry};
use std::io::{BufRead, BufReader};
use std::iter::Iterator;
use std::path::Path;
use std::thread;
use itertools::Itertools;
type CountMap = BTreeMap<String, u32>;
fn process_file(path: &Path) -> CountMap {
println!("mapping: {}", path.to_str().unwrap_or(""));
let file = File::open(&path).unwrap();
let rdr = BufReader::new(file);
let mut mapped : BTreeMap<String, u32> = BTreeMap::new();
for line in rdr.lines() {
match line {
Ok(line_) => {
for token in line_.split(char::is_whitespace) {
// Filter out multiple spaces delimiting to empty strings.
if token.len() > 0 {
*mapped.entry(token.to_owned()).or_insert(0) += 1;
}
}
}
Err(e) => {
println!("Error reading file: {}", e);
panic!("Error!");
}
}
}
mapped
}
fn print_counts<I, K, V>(counts : I) where
I: Iterator<Item=(K, V)>,
K: ::std::fmt::Display,
V: ::std::fmt::Display,
{
for (key, value) in counts {
println!("{}\t{}", key, value);
}
}
fn main() {
let args : Vec<String> = ::std::env::args().collect();
if args.len() < 2 {
println!("Usage: wordcount <infile|indir> <outfile>");
}
let filename = &args[1];
let path = ::std::path::Path::new(filename);
if path.is_file() {
println!("Just a file");
let mapped = process_file(path);
print_counts(mapped.iter());
} else if path.is_dir() {
println!("Partitions call for threads");
let mut guards = vec![];
for entry in read_dir(path).unwrap() {
let p : DirEntry = entry.unwrap();
let guard = thread::scoped(move || {
process_file(p.path().as_path())
});
guards.push(guard);
}
let counts = guards.into_iter().map(|x|{x.join()}).collect::<Vec<CountMap>>();
println!("shuffle/reduce.");
let mut counts_iter = counts.into_iter();
let zero : Box<Iterator<Item=(String, u32)>> = Box::new(counts_iter.next().unwrap().into_iter());
let merged_counts : Box<Iterator<Item=(String, u32)>> = counts_iter.fold(zero, |acc, item| {
Box::new(acc.merge_by(item.into_iter(), |a: &(String, u32), b: &(String, u32)| {
a.0.cmp(&b.0)
}))
});
let grouped = merged_counts.group_by(|&(a, _)| { a.clone() });
let aggregated = grouped.map( |(k, v)| {
(k, v.iter().map(|&(ref a, b)|{
b
}).sum::<u32>())
});
print_counts(aggregated);
}
}
```
So my question is basically am I doing something wrong to make the iterator expressions so baroque? It's an unfair comparison but if I look at an example in Scala using Spark, this becomes just:
```
val file = sc.textFile("my-input-file")
val counts = file.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
counts.saveAsTextFile("my-output-file")
```
I've seen a lot of ink spilled over the ergonomics of APIs but this doesn't feel terribly ergonomic. When I use `Vec<u32>`s as, like all the tests and examples, sure, it generally works very well.
Anyway, thanks for any help here and if you have examples of how to golf the code, I think it would be instructive not only for me but others who come to the forum to read up on how people with experience would tune up beginners' code.