However, for optimal behaviour I'd like to extend what it does (see above link) like that:
Emit EitherOrBoth::LeftDuplicate(i) when i == i_previous, and remove i from its source iterator
Emit EitherOrBoth::RightDuplicate(j) when j == j_previous, and remove j from its source iterator
And if neither of those hold true, continue with it's default behaviour:
Emit EitherOrBoth::Left(i) when i < j, and remove i from its source iterator
Emit EitherOrBoth::Right(j) when i > j, and remove j from its source iterator
Emit EitherOrBoth::Both(i, j) when i == j, and remove both i and j from their respective source iterators
How would I best go about achieving that, preferably without forking the whole library?
Also I guess keeping references to i_previous & j_previous would be difficult to accomplish with the borrow-checker, since the emitting of i / j transfers ownership to the consumer of the iterator.
Especially because of the problem with the references to the previous values and their ownership, I guess the better solution is to implement this logic in the consumer of the merged iterator. Here's my implementation of that idea:
use std::cmp::Ordering;
use std::io::{BufReader, BufRead, Write};
use std::fs::File;
use itertools::{Itertools, EitherOrBoth::{Left, Right, Both}};
const PATH_A: &str = "md5sums1_sorted.csv";
const PATH_B: &str = "md5sums2_sorted.csv";
fn main() -> std::io::Result<()> {
println!("targeted task: read two md5lists and find out which files are unique to each one, and which are shared\n");
let mut only_a = File::create("only_in_list_A.csv")?;
let mut only_b = File::create("only_in_list_B.csv")?;
let mut shared = File::create("shared_files.csv")?;
let reader_a = BufReader::new(File::open(PATH_A)?).lines().map(|l| l.unwrap());
let reader_b = BufReader::new(File::open(PATH_B)?).lines().map(|l| l.unwrap());
let mut a_prev = "_____________________________".to_string();
let mut b_prev = a_prev.clone();
for elem in reader_a.merge_join_by(reader_b, |a, b| md5line_cmp(a,b)) {
match elem {
Both(a, b) => {
_=writeln!(shared, "{a} = {b}");
a_prev = a; b_prev = b;
},
Left(a) => {
if md5line_cmp(&a, &a_prev) != Ordering::Equal {
_=writeln!(only_a, "{a}");
}
else {
println!("list_A contained duplicate: {a}")
}
a_prev = a;
},
Right(b) => {
if md5line_cmp(&b, &b_prev) != Ordering::Equal {
_=writeln!(only_b, "{b}");
}
else {
println!("list_B contained duplicate: {b}")
}
b_prev = b;
},
}
}
Ok(())
}
fn md5line_cmp(a: &str, b: &str) -> Ordering {
a[0..32].cmp(&b[0..32])
}
If anybody has a better idea / feedback for my code, I'd love to read about it!