Efficient way of checking if two files have the same content

johhnry · April 25, 2022, 10:00pm

Hi everyone,

I am building a static site generator and during the build phase, I need to check if two files have the same content meaning that they have the same bytes.

I went with this first approach:

fn is_same_file(file1: &Path, file2: &Path) -> Result<bool, std::io::Error> {
    let f1 = File::open(file1)?;
    let f2 = File::open(file2)?;

    // Check if file sizes are different
    if f1.metadata().unwrap().len() != f2.metadata().unwrap().len() {
        return Ok(false);
    }

    // Use buf readers since they are much faster
    let f1 = BufReader::new(f1);
    let f2 = BufReader::new(f2);

    // Do a byte to byte comparison of the two files
    for (b1, b2) in f1.bytes().zip(f2.bytes()) {
        if b1.unwrap() != b2.unwrap() {
            return Ok(false);
        }
    }

    return Ok(true);
}

I have few questions:

Is this implementation an efficient way of checking if two files are the same? Would you suggest another algorithm or technique?
I suppose zipping two iterators doesn't consume them?
Since checking if large files are the same is expensive, maybe computing a hash and checking it would be faster than checking each time if two files are the same? Or even better I could store the modification date of each file copied and check that later when copying again

Thanks

RedDocMD · April 26, 2022, 4:21am

No, it doesn't.

That would depend on the number of times a file is modified vs the no of times you check it. If a file is modified more often than it is checked, then a checksum won't help - since computing the checksum requires you to read the entire file. If its the other way round, then checksums save a lot of time.

Modification times tell you when the file potentially could have changed - it might still be the same. Of course, you may ignore this distinction and consider it changed anyway (make systems do that, git doesn't).

H2CO3 · April 26, 2022, 5:37am

That's very unreliable, since all sorts of events beyond your control change modification dates around almost randomly. Not to mention that there is no such thing as "the" last modified date, since platforms vary greatly in how they represent and interpret this information. On some platforms, for example, there are many related dates such as "last modified", "last opened", "created", and even when they are called the same on two different OSes, they won't mean the same thing.

This is almost definitely the solution you want, and incidentally, it is also the canonical solution that most "cache and recreate" systems end up using eventually. It might not be as fast as checking modification dates, but it can actually be made reliable because other processes won't at least overwrite your metadata files. Also, you could probably compute the hash while you are churning out the binary contents of the file, so that the hashing time would be amortized over the actual content generation.

2e71828 · April 26, 2022, 6:08am

Another factor to weigh in this analysis is that detecting differences via a checksum must always read the entire file, whereas a byte-for-byte comparison can stop as soon as it sees a difference. This favors direct comparison if you're usually comparing files that are quite different, but leans towards checksumming as you expect to compare more identical or nearly-identical files.

johhnry · April 26, 2022, 11:37am

Hi,

Thank you all for the answers, it's a great input!

Yes I didn't think about it, the fact that modification date can be changed but it can be ignored in this case since we do not want 100% reliability.

Yes this is exactly that: most of the time the files that need to be built are the same since I only edit html and css files for example. All the assets like images stay the same.

So hashing one file and comparing it with a previous hash would be at most twice faster as the current byte per byte checking since we only read one entire file is that right?

For example I have a 2,3MB gif in my assets so computing a hash of that file would be slightly longer than reading the whole file content depending on the hashing algorithm.

stonerfish · April 26, 2022, 1:13pm

OK, I just invented something. "An incremental checksum"
Like an md5 file, but after every N bytes it appends a new checksum to the end.
Then if comparing huge files, it only checks the entire file if every smaller checksum works.
I am sure someone already has this made.

The checksum file is something like this..

BLOCK = 1G
53.........hexhash.....d2  bigfile.png BLOCK1
4d.........hexhash.....ae  bigfile.png BLOCK2
a3.........hexhash.....33  bigfile.png BLOCK3
17.........hexhash.....2f  bigfile.png BLOCK4

Then you only check files all the way if there is nothing modified in the file. If there is a difference in the beginning then it finds it sooner.
Also you can create this checksum file partially. If you find a difference when on a early block, just save the checksums up to that point. You don't need to check the whole file unless you need to .

PS. I am sure there is already an implementation of this somewhere.

scottmcm · April 26, 2022, 6:11pm

Remember that if you're actually getting the file off disk (even SSD) then the CPU part of this is basically irrelevant, because it can consume it way faster than the disk can provide it.

I don't have high confidence that the byte zipping will optimize particularly well. I suspect it would be faster to read and compare bigger chunks -- maybe hundreds of bytes for the first chunk, since that'll be most likely different already, then thousands of bytes for later chunks.

And as far as I know, the actual disk subsystems in your OS can't read bytes anyway, so reading more at a time is probably "free".

It depends if you're comparing two files, or deduping lots of files. For just comparing, there's no need to bother with the hash.

For for deduping, might as well do that first by the hash of the first, say, 512 bytes, then only bother computing the hash of the rest of the file if you get collisions on those initial bytes. (That way you can even use a low-quality-but-fast hash for the initial check.)

johhnry · April 26, 2022, 10:21pm

Isn't it the job of the BufReader to read the file in chunks? Then I would need to increase the chunk size to a bigger value than the default one (8 KB)?

The thing is that most of the time files aren't changing so this method will also require to read the whole file until we know they are the same. But it's definitely faster thanks!

erelde · April 26, 2022, 10:26pm

Both @pkolaczk and I wrote file dupe finder programs. Hashing a file can actually be very fast on an SSD. But if you're going to hash only the first few bytes, you should hash the minimum read size (page_size) of your platform (generally 4096 bytes). Hashing less is a waste of time, if the file is larger hashing more can be superfluous syscalls.

Hashing can be so fast, the only bottleneck really is how fast you can walk and read from the filesystem.

scottmcm · April 26, 2022, 11:32pm

Yes, the BufReader will read it in larger chunks from the OS.

The question is whether the "go back to the BufReader to get the next byte" overhead will optimize out. If the code actually has to "go ask for one byte (from the BufReader), compare that one byte, repeat" then it's wasting a bunch of CPU capability -- at a very least you want to compare a usize worth of bytes at once: https://rust.godbolt.org/z/KPx74dG3j

kornel · April 27, 2022, 3:19am

If you're comparing only two files, then any hashing is unnecessary, and byte-by-byte comparison will be more efficient. However, I'm not sure if .bytes() iterator is the best for it, since handling of io::Error adds overhead and can't be optimized out. Reading in larger chunks should improve it.

If you're comparing many files against each other, then see the trick with btreemap:

system · July 26, 2022, 3:20am

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.

Topic		Replies	Views
Looking for feedback on my 80 lines of code help	9	763	October 13, 2022
Comparing file extensions without type-conversion hell	4	3375	January 12, 2023
Announcing a File-Based Integrity Checker announcements	1	655	January 12, 2023
Matching an array of bytes (file headers) precisely help	7	428	March 3, 2021
Need suggestion for reading large files help	4	1635	January 12, 2023

Efficient way of checking if two files have the same content

Related Topics