Looking for feedback on my 80 lines of code

Hi there!

I've written a little tool (that I'd normally have written as a bash script, but since I wanted to use the opportunity to improve my rudimentary rust skills have written in rust) which reads two sorted text files line by line (where each line starts with 32 characters that represent an md5hash of a file) and outputs 3 lists:

  1. of files only contained in list1
  2. of files only contained in list2
  3. of files contained in both lists.

I managed to get to a working solution. What I'm looking for now in general is feedback on how I could have done better, and in particular how I could adapt my solution to adhere to the DRY principle (Don't Repeat Yourself) - since lines 25-36 equal 65-76 in all but indentation.

Here's my code so far:

use std::io::{BufReader, BufRead, Write, self};
use std::fs::{File};

const PATH1: &str = "md5sums1_sorted.csv";
const PATH2: &str = "md5sums2_sorted.csv";

fn main() -> io::Result<()> {
    println!("targeted task: read two md5lists and find out which files are unique to each one, and which are shared\n");
    
    let mut only1 = File::create("only_in_list1.csv")?;
    let mut only2 = File::create("only_in_list2.csv")?;
    let mut shared = File::create("shared_files.csv")?;
    
    let file1 = File::open(PATH1)?;
    let file2 = File::open(PATH2)?;
    let reader1 = BufReader::new(file1);
    let mut reader2 = BufReader::new(file2);
    let mut line2 = String::new();
    let mut line2_old: String;
    let mut line2_matched = false;
    reader2.read_line(&mut line2)?;
    for line1o in reader1.lines() {
        let line1 = line1o?;
        println!("loop over file1, comparing lines: {} : {}", line1, line2);
        if line2.len() > 32 && line1[0..32].eq(&line2[0..32]) {
            println!("contained in both files (1): {line1} = {line2}");
            _=write!(shared, "{line1} = {line2}");
            line2_matched = true;
        }
        else if line2.len() < 32 || line1[0..32].lt(&line2[0..32]) {
            if line2.len() > 32 {
                println!("{} < {}", &line1[0..32], &line2[0..32]);
            }
            println!("only in file1 (1): {line1}\n");
            _=writeln!(only1, "{line1}");
        }
        else {
            println!("{} > {}", &line1[0..32], &line2[0..32]);
            if !line2_matched {
                println!("only in file2 (1): {line2}");
                _=write!(only2, "{line2}");
            }
            else {
                line2_matched = false;
            }

            loop {
                line2_old = line2;
                line2 = String::new();
                let read = reader2.read_line(&mut line2)?;
                println!("read new line2 = {line2}");
                if read != 0 && line2.len() > 32 && line2[0..32].lt(&line1[0..32]) {
                    if ! line2_old[0..32].eq(&line2[0..32]) {
                        println!("only in file2 (2): {line2}");
                        _=write!(only2, "{line2}");
                    }
                    else {
                        println!("file2 contained a duplicate: {line2_old} = {line2}")
                    }
                }
                else {
                    break;
                }
            }
            if line2.len() > 32 && line1[0..32].eq(&line2[0..32]) {
                println!("contained in both files (2): {line1} = {line2}");
                _=write!(shared, "{line1} = {line2}");
                line2_matched = true;
            }
            else if line2.len() < 32 || line1[0..32].lt(&line2[0..32]) {
                if line2.len() > 32 {
                    println!("{} < {}", &line1[0..32], &line2[0..32]);
                }
                println!("only in file1 (2): {line1}\n");
                _=writeln!(only1, "{line1}");
            }
            println!("\n\nend of loop - is anything missing here?\n\n");
        }
    }
    Ok(())
}

p.s.: In the unlikely case that anybody finds above code useful for any purpose, you're free to copy / modify / republish it wherever you like, please just reference where you found it and my username.

p.p.s.: My question for general feedback is not meant as an invitation to scorn me for my abhorrently bad error handling, this was never meant to be more than a "quick and dirty" solution. I've only added checks for line2 being at least 32 characters long since the way I'm reading file2 gives me an empty string after I've finished reading it.

Okey, since it didn't stop bugging me and I had the time to work on it some more, I managed to come up with a solution that's much DRYer than before, even though it's just 2 lines shorter:

use std::io::{BufReader, BufRead, Write, self};
use std::fs::{File};

const PATH1: &str = "md5sums1_sorted.csv";
const PATH2: &str = "md5sums2_sorted.csv";

fn main() -> io::Result<()> {
    println!("targeted task: read two md5lists and find out which files are unique to each one, and which are shared\n");
    
    let mut only1 = File::create("only_in_list1.csv")?;
    let mut only2 = File::create("only_in_list2.csv")?;
    let mut shared = File::create("shared_files.csv")?;
    
    let file1 = File::open(PATH1)?;
    let file2 = File::open(PATH2)?;
    let reader1 = BufReader::new(file1);
    let mut reader2 = BufReader::new(file2);
    let mut line2 = String::new();
    let mut line2_old: String;
    let mut line2_matched = false;
    reader2.read_line(&mut line2)?;
    for line1o in reader1.lines() {
        let line1 = line1o?;
        println!("loop over file1, comparing lines: {} : {}", line1, line2);
        let handled = handle_equal_and_only_file1(&line1, &line2, &mut shared, &mut only1);
        line2_matched = line2_matched || handled.1;
        if !handled.0 {
            println!("{} > {}", &line1[0..32], &line2[0..32]);
            if !line2_matched {
                println!("only in file2 (1): {line2}");
                _=write!(only2, "{line2}");
            }
            else {
                line2_matched = false;
            }

            loop {
                line2_old = line2;
                line2 = String::new();
                let read = reader2.read_line(&mut line2)?;
                println!("read new line2 = {line2}");
                if read != 0 && line2.len() > 32 && line2[0..32].lt(&line1[0..32]) {
                    if ! line2_old[0..32].eq(&line2[0..32]) {
                        println!("only in file2 (2): {line2}");
                        _=write!(only2, "{line2}");
                    }
                    else {
                        println!("file2 contained a duplicate: {line2_old} = {line2}")
                    }
                }
                else {
                    break;
                }
            }
            let handled = handle_equal_and_only_file1(&line1, &line2, &mut shared, &mut only1);
            line2_matched = line2_matched || handled.1;
            
            println!("\n\nend of loop - is anything missing here?\n\n");
        }
    }
    Ok(())
}

fn handle_equal_and_only_file1(line1: &str, line2: &str, shared: &mut File, only1: &mut File) -> (bool /* line1 handled */, bool /* line2 handled */) {
    if line2.len() > 32 && line1[0..32].eq(&line2[0..32]) {
        println!("contained in both files (1): {line1} = {line2}");
        _=write!(shared, "{line1} = {line2}");
        return (true, true);
    }
    else if line2.len() < 32 || line1[0..32].lt(&line2[0..32]) {
        if line2.len() > 32 {
            println!("{} < {}", &line1[0..32], &line2[0..32]);
        }
        println!("only in file1 (1): {line1}\n");
        _=writeln!(only1, "{line1}");
        return (true, false);
    }
    return (false, false);
}

But I'd still be very interested in feedback on how I could have done better, like maybe performance wise, or readability wise, or anything else you can think of that I could learn from.

I single loop would be less confusing.
Splits into two steps write and advance.
enum makes a great state register in recording stage advance should read from next.

All the 32s are ugly and discourage reading so can hide bugs.

What I would say is that it still looks like a translation from a shell script.

I would suggest that you try completely starting from scratch and structuring it around the fact that you can have complex data structures and algorithms in rust as separate things.

A couple of ideas, from which you could pick your favourite parts to try:

  • You could use Itertools::merge_join_by to implement the comparison logic since your files are sorted already, since that already does the "merge sorted things and give out EitherOrBoth::{Left, Right, Both} accordingly".
  • Your filename is csv, so you could try using the CSV — Rust parser // Lib.rs crate.
  • You could write a function to read a file as a HashMap in std::collections - Rust, call that twice, then compare the keys from those maps.
1 Like

I would probably collect everything in one Map<Hash, EnumFile1orFile2orBoth> (or Map<Hash, Vec<Provenance>> to be able to have more than one file)

Thanks everyone for your many suggestions! I love how responsive and helpful the rust community is! :smiley:

@erelde - I've purposefully written the tool in a way, that it doesn't need as much memory as the two md5sum-lists are big, but to just keep as much of the data in memory as it needs to answer the current question.

@jonh - do you think it would look better if I'd extract the 32 (and the 0?) into constants? That would definitely not make the code shorter, quite the contrary.

I don't know how I could get rid of the inner loop to just have a single one? Could you elaborate?

@scottmcm - for the purpose of this tool parsing the csv seemed overkill to me. Only the first 32 chars of each line are needed to answer the question.

About it looking like a bash script - I noticed that too. I guess it was my mindset when writing it. I'm not too sure how to prevent this.
I understand that rust can have complex data structures, but I don't see how that could have been useful for the task at hand.

I will definitely have a look at Itertools::merge_join_by - thank you for that reference!

UPDATE, oh my good this is beautiful, thank you!!!

use std::io::{BufReader, BufRead, Write, self};
use std::fs::File;
use itertools::{Itertools, EitherOrBoth::{Left, Right, Both}};

const PATH_A: &str = "md5sums1_sorted.csv";
const PATH_B: &str = "md5sums2_sorted.csv";

fn main() -> io::Result<()> {
    println!("targeted task: read two md5lists and find out which files are unique to each one, and which are shared\n");
    let mut only_a = File::create("only_in_list_A.csv")?;
    let mut only_b = File::create("only_in_list_B.csv")?;
    let mut shared = File::create("shared_files.csv")?;
    
    let reader_a = BufReader::new(File::open(PATH_A)?).lines().map(|l| l.unwrap());
    let reader_b = BufReader::new(File::open(PATH_B)?).lines().map(|l| l.unwrap());

    for elem in reader_a.merge_join_by(reader_b, |a, b| a[0..32].cmp(&b[0..32])) {
        match elem {
            Both(a, b) => _=writeln!(shared, "{a} = {b}"),
            Left(a) => _=writeln!(only_a, "{a}"),
            Right(b) => _=writeln!(only_b, "{b}"),
        }
    }
    Ok(())
}
1 Like

TBH, I'd say to not bother, most of the time.

The md5sum lists are not that big. On a modern machine -- even phones, really -- you can easily load two 1 GB hashmaps, and I doubt the lists are that bad.

Of course, if you want to do it lazily as an exercise, or because you found a nice way to do it anyway, that's great. But my first instinct for how to do this would absolutely be to just load it into memory.

(One piece I'd point out here is that you had to sort the CSVs in the first place, which probably did that by loading it into memory.)

I'm ignoring the "quick and dirty" part of your original post, because if the point is to get something working ASAP and throw it away then do whatever.

But the general thing I'd say is that you should think about how you wish you had the data, and try to reflect that somehow. The fact that it's coming from a File in a particular format is irrelevant when it comes to the merging logic.

In a way, that's what the latest code you posted does:

  • Turn the files into iterators
  • Run the logic to merge the iterators
  • Write out the merge results to the correct place

That can also be a big help for testing it. If you write the function to work on impl BufRead/BufWrite, then you can test it with in-memory things like Strings instead of needing to write them out to files.

1 Like

Thanks for another well thought out answer! :slight_smile:

About the memory thing: You're right, one should start optimizing when bad performance necessitates it, not before. But I simply love that rust allows to write so well optimized code that I gravitate towards doing so, just for fun. At least as long as it doesn't seem like a big tradeoff in much more time required to write the code or much less readability - which it didn't look like for me in this case.

About the "quick and dirty" comment - I meant that it's just something I wrote for myself, because I needed it's output. I'm not getting paid for it or plan on publishing it somewhere (more public than here).

I still wanted to use it as an exercise to improve my rust coding skills and also do it in a way that hopefully allows me to still understand what it does if I stumble upon it again in a few years :wink:

Thanks for your hint about Iterators + flexibility & test-ability. I agree they're a beautiful coding pattern :slight_smile: And if I understand what it does correctly, this still allows my tool to not have to read the complete files into memory but take advantage of the BufferedReader to proces them chunk-wise.

I overlooked that one :wink: You're right of course, but that's an older component of this workflow and should remain a separate tool on purpose.

I have those md5lists lying around as metadata for most of my datasets. This allows me to separate the time intensive task of reading the complete datasets to generate those list - which can take many hours - from the manual step of deciding what I want to do with overlaps / differences.

Since it's implemented as a bash script, the actual sorting is done by the GNU coreutils sort utility.

And judging from this discussion sort actually would have options to balance the resource usage between disk and memory.

If that separation is not required or a nice GUI is more important I'd suggest unison if the focus is on differences and czkawka (which is a rust rewrite of fslint-gui) if the focus is on overlaps.

UPDATE: for the continuation of the story of the tool this topic is about, see Howto create a modified version of Itertools::merge_join_by?

1 Like