Howto create a modified version of Itertools::merge_join_by?

babiro2026 · July 15, 2022, 7:57am

After @scottmcm told me of the existence of Itertools::merge_join_by, I managed to use it to simplify my current tool from 79 to 25 lines of code.

However, for optimal behaviour I'd like to extend what it does (see above link) like that:

Emit EitherOrBoth::LeftDuplicate(i) when i == i_previous, and remove i from its source iterator
Emit EitherOrBoth::RightDuplicate(j) when j == j_previous, and remove j from its source iterator

And if neither of those hold true, continue with it's default behaviour:

Emit EitherOrBoth::Left(i) when i < j, and remove i from its source iterator
Emit EitherOrBoth::Right(j) when i > j, and remove j from its source iterator
Emit EitherOrBoth::Both(i, j) when i == j, and remove both i and j from their respective source iterators

How would I best go about achieving that, preferably without forking the whole library?

Also I guess keeping references to i_previous & j_previous would be difficult to accomplish with the borrow-checker, since the emitting of i / j transfers ownership to the consumer of the iterator.

babiro2026 · July 15, 2022, 8:38am

Especially because of the problem with the references to the previous values and their ownership, I guess the better solution is to implement this logic in the consumer of the merged iterator. Here's my implementation of that idea:

use std::cmp::Ordering;
use std::io::{BufReader, BufRead, Write};
use std::fs::File;
use itertools::{Itertools, EitherOrBoth::{Left, Right, Both}};

const PATH_A: &str = "md5sums1_sorted.csv";
const PATH_B: &str = "md5sums2_sorted.csv";

fn main() -> std::io::Result<()> {
    println!("targeted task: read two md5lists and find out which files are unique to each one, and which are shared\n");
    let mut only_a = File::create("only_in_list_A.csv")?;
    let mut only_b = File::create("only_in_list_B.csv")?;
    let mut shared = File::create("shared_files.csv")?;
    
    let reader_a = BufReader::new(File::open(PATH_A)?).lines().map(|l| l.unwrap());
    let reader_b = BufReader::new(File::open(PATH_B)?).lines().map(|l| l.unwrap());

    let mut a_prev = "_____________________________".to_string();
    let mut b_prev = a_prev.clone();
    
    for elem in reader_a.merge_join_by(reader_b, |a, b| md5line_cmp(a,b)) {
        match elem {
            Both(a, b) => {
                _=writeln!(shared, "{a} = {b}");
                a_prev = a; b_prev = b;
            },
            Left(a) => {
                if md5line_cmp(&a, &a_prev) != Ordering::Equal {
                    _=writeln!(only_a, "{a}");
                }
                else {
                    println!("list_A contained duplicate: {a}")
                }
                a_prev = a;
            },
            Right(b) => {
                if md5line_cmp(&b, &b_prev) != Ordering::Equal {
                    _=writeln!(only_b, "{b}");
                }
                else {
                    println!("list_B contained duplicate: {b}")
                }
                b_prev = b;
            },
        }
    }
    Ok(())
} 

fn md5line_cmp(a: &str, b: &str) -> Ordering {
    a[0..32].cmp(&b[0..32])
}

If anybody has a better idea / feedback for my code, I'd love to read about it!

system · October 13, 2022, 8:38am

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.

Topic		Replies	Views
[Solved] Merge multiple sorted vectors using iterators help	13	6693	January 12, 2023
Is there an itertools fn for this?	16	565	January 18, 2023
Join vec/slice/iterator elements into an output stream directly	2	553	January 12, 2023
Idiomatic way to re-iterate on a consuming iterator? help	4	513	April 8, 2020
Troubles creating generic function interface for iterables & their references help	6	944	January 12, 2023

Howto create a modified version of Itertools::merge_join_by?

Related Topics