Howto create a modified version of Itertools::merge_join_by?

After @scottmcm told me of the existence of Itertools::merge_join_by, I managed to use it to simplify my current tool from 79 to 25 lines of code.

However, for optimal behaviour I'd like to extend what it does (see above link) like that:

  • Emit EitherOrBoth::LeftDuplicate(i) when i == i_previous, and remove i from its source iterator
  • Emit EitherOrBoth::RightDuplicate(j) when j == j_previous, and remove j from its source iterator

And if neither of those hold true, continue with it's default behaviour:

  • Emit EitherOrBoth::Left(i) when i < j, and remove i from its source iterator
  • Emit EitherOrBoth::Right(j) when i > j, and remove j from its source iterator
  • Emit EitherOrBoth::Both(i, j) when i == j, and remove both i and j from their respective source iterators

How would I best go about achieving that, preferably without forking the whole library?

Also I guess keeping references to i_previous & j_previous would be difficult to accomplish with the borrow-checker, since the emitting of i / j transfers ownership to the consumer of the iterator.

Especially because of the problem with the references to the previous values and their ownership, I guess the better solution is to implement this logic in the consumer of the merged iterator. Here's my implementation of that idea:

use std::cmp::Ordering;
use std::io::{BufReader, BufRead, Write};
use std::fs::File;
use itertools::{Itertools, EitherOrBoth::{Left, Right, Both}};

const PATH_A: &str = "md5sums1_sorted.csv";
const PATH_B: &str = "md5sums2_sorted.csv";

fn main() -> std::io::Result<()> {
    println!("targeted task: read two md5lists and find out which files are unique to each one, and which are shared\n");
    let mut only_a = File::create("only_in_list_A.csv")?;
    let mut only_b = File::create("only_in_list_B.csv")?;
    let mut shared = File::create("shared_files.csv")?;
    
    let reader_a = BufReader::new(File::open(PATH_A)?).lines().map(|l| l.unwrap());
    let reader_b = BufReader::new(File::open(PATH_B)?).lines().map(|l| l.unwrap());

    let mut a_prev = "_____________________________".to_string();
    let mut b_prev = a_prev.clone();
    
    for elem in reader_a.merge_join_by(reader_b, |a, b| md5line_cmp(a,b)) {
        match elem {
            Both(a, b) => {
                _=writeln!(shared, "{a} = {b}");
                a_prev = a; b_prev = b;
            },
            Left(a) => {
                if md5line_cmp(&a, &a_prev) != Ordering::Equal {
                    _=writeln!(only_a, "{a}");
                }
                else {
                    println!("list_A contained duplicate: {a}")
                }
                a_prev = a;
            },
            Right(b) => {
                if md5line_cmp(&b, &b_prev) != Ordering::Equal {
                    _=writeln!(only_b, "{b}");
                }
                else {
                    println!("list_B contained duplicate: {b}")
                }
                b_prev = b;
            },
        }
    }
    Ok(())
} 

fn md5line_cmp(a: &str, b: &str) -> Ordering {
    a[0..32].cmp(&b[0..32])
}

If anybody has a better idea / feedback for my code, I'd love to read about it! :slight_smile:

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.