String comparison not working uniformly

I am working on a file metadata comparison tool and ran into an issue regarding string comparisons. I am new to Rust, so apologies in advance for beginner syntax.

I have the following struct:

struct Data {
    file_name: String, // includes directory and file
    hash: String,
    modification_time: u64,
    file_size: u64 // in bytes
}

My code reads from a CSV file and then takes that data adds it to a Hashmap HashMap<u128,Data>. I then read all files from a directory, and then compare that with what is already in the hashmap. If it is duplicated in the hashmap, I ask it to print "Same". The issue is that the code is working for every struct value, but does not work uniformly for the hash value. It works sometimes, but other times it doesn't. I was not able to pinpoint the exact error.

Code snippet where I compare the strings:

    for entry in WalkDir::new(directory.clone().expect("Directory not accessible?")).into_iter().filter_map(|e| e.ok()) {
        if entry.metadata().unwrap().is_file() {
            //println!("PATH: {}", entry.path().display());
            // for each file add to Data variable and then send into the read_from_file hashmap
            // if the value for the file is already there, then if the hash and modification time is new
            // say updated, otherwise damage is the hash is different and mod is the same, else new
            temp_data2 = Data {
                file_name: entry.path().display().to_string().clone(),
                hash: hashing(entry.path().display().to_string().clone()),
                modification_time: modified_time(entry.path().display().to_string().clone()),
                file_size: file_size(entry.path().display().to_string())
            };
            // checking what is added to the hashmap
            // read_from_file is the hashmap
            for data in &read_from_file {
                if data.1.file_name.clone() == temp_data2.file_name.clone() && data.1.modification_time.clone() == temp_data2.modification_time.clone() && data.1.file_size.clone() == temp_data2.file_size.clone() {
                    println!("SAME file --> {:?} VS {:?}", &data.clone(), &temp_data2);
                    // ERROR is here - works only for one of the cases.
                    if data.1.hash.as_str() == temp_data2.hash.as_str() {
                        println!("HASH is same {:?} == {:?}", data.1.hash, temp_data2.hash);
                    }
                } 
            }
            read_from_file.insert(counter, temp_data2);
            counter+= 1;
        }
    }

Output:

SAME file --> (4, Data { file_name: "../workshop/filename18.txt", hash: "5489360c396292f7c7cfab97077234bbdc485573d8cccec0effffd3b90175d48", modification_time: 1744569614, file_size: 11 }) VS Data { file_name: "../workshop/filename18.txt", hash: "548936c396292f7c7cfab9777234bbdc485573d8cccec0effffd3b90175d48", modification_time: 1744569614, file_size: 11 }
SAME file --> (2, Data { file_name: "../workshop/filename13.txt", hash: "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855", modification_time: 1744249986, file_size: 0 }) VS Data { file_name: "../workshop/filename13.txt", hash: "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855", modification_time: 1744249986, file_size: 0 }
HASH is same "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855" == "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855"
SAME file --> (3, Data { file_name: "../workshop/filename2.txt", hash: "20da97c8e88eab44fa9fb64c3a17b22cb10599bba4f141320d00cd355edd11ea", modification_time: 1744565017, file_size: 11 }) VS Data { file_name: "../workshop/filename2.txt", hash: "20da97c8e88eab44fa9fb64c3a17b22cb1599bba4f14132d0cd355edd11ea", modification_time: 1744565017, file_size: 11 }

The issue is that the "Hash is same" only prints out for one case, even when it should for the other lines as well. I am just trying to figure that out. Am I missing something really basic here? I can provide the whole code as well if that would be helpful.

1 Like

If I copy the two files where the hash should be the same and compare them, the hashes are not equal:

fn main() {
    assert_eq!(
        "5489360c396292f7c7cfab97077234bbdc485573d8cccec0effffd3b90175d48",
        "548936c396292f7c7cfab9777234bbdc485573d8cccec0effffd3b90175d48"
    );
    
    assert_eq!(
        "20da97c8e88eab44fa9fb64c3a17b22cb10599bba4f141320d00cd355edd11ea",
        "20da97c8e88eab44fa9fb64c3a17b22cb1599bba4f14132d0cd355edd11ea",
    );
}

Playground.

I don't know where the hashing function comes from you use to generate the hash, would you mind telling us where you got this from? Or if you wrote it yourself, could you please share the source?

ah great catch on that! I was using an online Diff tool, but it somehow got passed that and just a visual check as well. I wrote the hashing function myself based on use sha2::{Sha256, Digest}; from sha2 = "0.10.8"

pub fn hashing(input_file: String) -> String {
    let mut output_string: String = "".to_string();
    let mut hasher = Sha256::new();
    hasher.update(std::fs::read(input_file).unwrap());
    // Output is a vector of decimal values - need to change to hex
    let result = hasher.finalize();
    // loop to convert to hex and add to output string
    for decimal in result {
        output_string += &format!("{:x}", decimal).to_string();
        // Remove after - for testing
        //println!("{}", output_string.clone());
    }
    // return String
    output_string
}

based on your examples above, it seems my hash function was "skipping" the 0 number in the output.

I'm currently tinkering around with the output_string += &format!("{:x}", decimal).to_string(); line to see how I can fix this error.

You need {:02x} instead of {:x} for left-padding with 0 when there are less than 2 digits to print.

Also, the format! allocates and will probably murder your performance there. Prefer using the write! macro, or even better computing the two hex characters and pushing them manually.

1 Like

That worked! Big thanks to you and @jofas for this! I'll try to leverage the write macro here as well.

Side note: You should really be running cargo clippy on your code. Just at first glance, I see a lot of unnecessary clone()s and other allocations that are completely unjustified.

Thanks for pointing that out. I'll go back and review those towards the end of this project.