Efficient way of checking if two files have the same content

Hi everyone,

I am building a static site generator and during the build phase, I need to check if two files have the same content meaning that they have the same bytes.

I went with this first approach:

fn is_same_file(file1: &Path, file2: &Path) -> Result<bool, std::io::Error> {
    let f1 = File::open(file1)?;
    let f2 = File::open(file2)?;

    // Check if file sizes are different
    if f1.metadata().unwrap().len() != f2.metadata().unwrap().len() {
        return Ok(false);
    }

    // Use buf readers since they are much faster
    let f1 = BufReader::new(f1);
    let f2 = BufReader::new(f2);

    // Do a byte to byte comparison of the two files
    for (b1, b2) in f1.bytes().zip(f2.bytes()) {
        if b1.unwrap() != b2.unwrap() {
            return Ok(false);
        }
    }

    return Ok(true);
}

I have few questions:

  • Is this implementation an efficient way of checking if two files are the same? Would you suggest another algorithm or technique?
  • I suppose zipping two iterators doesn't consume them?
  • Since checking if large files are the same is expensive, maybe computing a hash and checking it would be faster than checking each time if two files are the same? Or even better I could store the modification date of each file copied and check that later when copying again :wink:

Thanks

1 Like

No, it doesn't.

That would depend on the number of times a file is modified vs the no of times you check it. If a file is modified more often than it is checked, then a checksum won't help - since computing the checksum requires you to read the entire file. If its the other way round, then checksums save a lot of time.

Modification times tell you when the file potentially could have changed - it might still be the same. Of course, you may ignore this distinction and consider it changed anyway (make systems do that, git doesn't).

9 Likes

That's very unreliable, since all sorts of events beyond your control change modification dates around almost randomly. Not to mention that there is no such thing as "the" last modified date, since platforms vary greatly in how they represent and interpret this information. On some platforms, for example, there are many related dates such as "last modified", "last opened", "created", and even when they are called the same on two different OSes, they won't mean the same thing.

This is almost definitely the solution you want, and incidentally, it is also the canonical solution that most "cache and recreate" systems end up using eventually. It might not be as fast as checking modification dates, but it can actually be made reliable because other processes won't at least overwrite your metadata files. Also, you could probably compute the hash while you are churning out the binary contents of the file, so that the hashing time would be amortized over the actual content generation.

12 Likes

Another factor to weigh in this analysis is that detecting differences via a checksum must always read the entire file, whereas a byte-for-byte comparison can stop as soon as it sees a difference. This favors direct comparison if you're usually comparing files that are quite different, but leans towards checksumming as you expect to compare more identical or nearly-identical files.

13 Likes

Hi,

Thank you all for the answers, it's a great input!

Yes I didn't think about it, the fact that modification date can be changed but it can be ignored in this case since we do not want 100% reliability.

Yes this is exactly that: most of the time the files that need to be built are the same since I only edit html and css files for example. All the assets like images stay the same.

So hashing one file and comparing it with a previous hash would be at most twice faster as the current byte per byte checking since we only read one entire file is that right?

For example I have a 2,3MB gif in my assets so computing a hash of that file would be slightly longer than reading the whole file content depending on the hashing algorithm.

OK, I just invented something. "An incremental checksum"
Like an md5 file, but after every N bytes it appends a new checksum to the end.
Then if comparing huge files, it only checks the entire file if every smaller checksum works.
I am sure someone already has this made.

The checksum file is something like this..

BLOCK = 1G
53.........hexhash.....d2  bigfile.png BLOCK1
4d.........hexhash.....ae  bigfile.png BLOCK2
a3.........hexhash.....33  bigfile.png BLOCK3
17.........hexhash.....2f  bigfile.png BLOCK4

Then you only check files all the way if there is nothing modified in the file. If there is a difference in the beginning then it finds it sooner.
Also you can create this checksum file partially. If you find a difference when on a early block, just save the checksums up to that point. You don't need to check the whole file unless you need to .

PS. I am sure there is already an implementation of this somewhere.

1 Like

Remember that if you're actually getting the file off disk (even SSD) then the CPU part of this is basically irrelevant, because it can consume it way faster than the disk can provide it.

I don't have high confidence that the byte zipping will optimize particularly well. I suspect it would be faster to read and compare bigger chunks -- maybe hundreds of bytes for the first chunk, since that'll be most likely different already, then thousands of bytes for later chunks.

And as far as I know, the actual disk subsystems in your OS can't read bytes anyway, so reading more at a time is probably "free".

It depends if you're comparing two files, or deduping lots of files. For just comparing, there's no need to bother with the hash.

For for deduping, might as well do that first by the hash of the first, say, 512 bytes, then only bother computing the hash of the rest of the file if you get collisions on those initial bytes. (That way you can even use a low-quality-but-fast hash for the initial check.)

5 Likes

Isn't it the job of the BufReader to read the file in chunks? Then I would need to increase the chunk size to a bigger value than the default one (8 KB)?

The thing is that most of the time files aren't changing so this method will also require to read the whole file until we know they are the same. But it's definitely faster thanks! :wink:

Both @pkolaczk and I wrote file dupe finder programs. Hashing a file can actually be very fast on an SSD. But if you're going to hash only the first few bytes, you should hash the minimum read size (page_size) of your platform (generally 4096 bytes). Hashing less is a waste of time, if the file is larger hashing more can be superfluous syscalls.

Hashing can be so fast, the only bottleneck really is how fast you can walk and read from the filesystem.

2 Likes

Yes, the BufReader will read it in larger chunks from the OS.

The question is whether the "go back to the BufReader to get the next byte" overhead will optimize out. If the code actually has to "go ask for one byte (from the BufReader), compare that one byte, repeat" then it's wasting a bunch of CPU capability -- at a very least you want to compare a usize worth of bytes at once: https://rust.godbolt.org/z/KPx74dG3j

1 Like

If you're comparing only two files, then any hashing is unnecessary, and byte-by-byte comparison will be more efficient. However, I'm not sure if .bytes() iterator is the best for it, since handling of io::Error adds overhead and can't be optimized out. Reading in larger chunks should improve it.

If you're comparing many files against each other, then see the trick with btreemap:

3 Likes