Fastest algorithm to detect duplicate files

Question

In the process of finding duplicates in my 2 terabytes of HDD stored images I was astonished about the long run times of the tools fslint and fslint-gui.
So I analyzed the internals of the core tool findup which is implemented as very well written and documented shell script using an ultra-long pipe. Essentially its based on find and hashing (md5 and SHA1). The author states that it was faster than any other alternative which I couldn't believe. So I found Detecting duplicate files where the topic quite fast slided towards hashing and comparing hashes which is not the best and fastest way in my opinion.

So the usual algorithm seems to work like this:

generate a sorted list of all files (path, Size, id)
group files with the exact same size
calculate the hash of all the files with a same size and compare the hashes
same has means identical files - a duplicate is found

Sometimes the speed gets increased by first using a faster hash algorithm (like md5) with more collision probability and second if the hash is the same use a second slower but less collision-a-like algorithm to prove the duplicates. Another improvement is to first only hash a small chunk to sort out totally different files.

So I've got the opinion that this scheme is broken in two different dimensions:

duplicate candidates get read from the slow HDD again (first chunk) and again (full md5) and again (sha1)
by using a hash instead just comparing the files byte by byte we introduce a (low) probability of a false negative
a hash calculation is a lot slower than just byte-by-byte compare

I found one (Windows) app which states to be fast by not using this common hashing scheme.

Am I totally wrong with my ideas and opinion?

[Update]

There seems to be some opinion that hashing might be faster than comparing. But that seems to be a misconception out of the general use of "hash tables speed up things". But to generate a hash of a file the first time the files needs to be read fully byte by byte. So there a byte-by-byte-compare on the one hand, which only compares so many bytes of every duplicate-candidate function till the first differing position. And there is the hash function which generates an ID out of so and so many bytes - lets say the first 10k bytes of a terabyte or the full terabyte if the first 10k are the same. So under the assumption that I don't usually have a ready calculated and automatically updated table of all files hashes I need to calculate the hash and read every byte of duplicates candidates. A byte-by-byte compare doesn't need to do this.

[Update 2]

I've got a first answer which again goes into the direction: "Hashes are generally a good idea" and out of that (not so wrong) thinking trying to rationalize the use of hashes with (IMHO) wrong arguments. "Hashes are better or faster because you can reuse them later" was not the question. "Assuming that many (say n) files have the same size, to find which are duplicates, you would need to make n * (n-1) / 2 comparisons to test them pair-wise all against each other. Using strong hashes, you would only need to hash each of them once, giving you n hashes in total." is skewed in favor of hashes and wrong (IMHO) too. Why can't I just read a block from each same-size file and compare it in memory? If I have to compare 100 files I open 100 file handles and read a block from each in parallel and then do the comparison in memory. This seams to be a lot faster then to update one or more complicated slow hash algorithms with these 100 files.

[Update 3]

Given the very big bias in favor of "one should always use hash functions because they are very good!" I read through some SO questions on hash quality e.g. this: Which hashing algorithm is best for uniqueness and speed? It seams that common hash functions more often produce collisions then we think thanks to bad design and the birthday paradoxon. The test set contained: "A list of 216,553 English words (in lowercase), the numbers "1" to "216553" (think ZIP codes, and how a poor hash took down msn.com) and 216,553 "random" (i.e. type 4 uuid) GUIDs". These tiny data sets produced from arround 100 to nearly 20k collisions. So testing millions of files on (in)equality only based on hashes might be not a good idea at all.

I guess I need to modify 1 and replace the md5/sha1 part of the pipe with "cmp" and just measure times. I keep you updated.

[Update 4] Thanks for alle the feedback. Slowly we are converting. Background is what I observed when fslints findup had running on my machine md5suming hundreds of images. That took quite a while and HDD was spinning like hell. So I was wondering "what the heck is this crazy tool thinking in destroying my HDD and taking huge amounts of time when just comparing byte-by-byte" is 1) less expensive per byte then any hash or checksum algorithm and 2) with a byte-by-byte compare I can return early on the first difference so I save tons of time not wasting HDD bandwidth and time by reading full files and calculating hashs over full files. I still think thats true - but: I guess I didn't catch the point that a 1:1 comparison (if (file_a[i] != file_b[i]) return 1;) might be cheaper than is hashing per byte. But complexity wise hashing with O(n) may win when more and files need to be compared against each other. I have set this problem on my list and plan to either replace the md5 part of findup's fslint with cmp or enhance pythons filecmp.py compare lib which only compares 2 files at once with a multiple files option and maybe a md5hash version. So thank you all for the moment. And generally the situation is like you guys say: the best way (TM) totally depends on the circumstances: HDD vs SSD, likelyhood of same length files, duplicate files, typical files size, performance of CPU vs. Memory vs. Disk, Single vs. Multicore and so on. And I learned that I should considder more often using hashes - but I'm an embedded developer with most of the time very very limited resources ;-)

Thanks for all your effort! Marcel

tucuxi · Accepted Answer

The fastest de-duplication algorithm will depend on several factors:

how frequent is it to find near-duplicates? If it is extremely frequent to find hundreds of files with the exact same contents and a one-byte difference, this will make strong hashing much more attractive. If it is extremely rare to find more than a pair of files that are of the same size but have different contents, hashing may be unnecessary.
how fast is it to read from disk, and how large are the files? If reading from the disk is very slow or the files are very small, then one-pass hashes, however cryptographically strong, will be faster than making small passes with a weak hash and then a stronger pass only if the weak hash matches.
how many times are you going to run the tool? If you are going to run it many times (for example to keep things de-duplicated on an on-going basis), then building an index with the path, size & strong_hash of each and every file may be worth it, because you would not need to rebuild it on subsequent runs of the tool.
do you want to detect duplicate folders? If you want to do so, you can build a Merkle tree (essentially a recursive hash of the folder's contents + its metadata); and add those hashes to the index too.
what do you do with file permissions, modification date, ACLs and other file metadata that excludes the actual contents? This is not related directly to algorithm speed, but it adds extra complications when choosing how to deal with duplicates.

Therefore, there is no single way to answer the original question. Fastest when?

Assuming that two files have the same size, there is, in general, no fastest way to detect whether they are duplicates or not than comparing them byte-by-byte (even though technically you would compare them block-by-block, as the file-system is more efficient when reading blocks than individual bytes).

Assuming that many (say n) files have the same size, to find which are duplicates, you would need to make n * (n-1) / 2 comparisons to test them pair-wise all against each other. Using strong hashes, you would only need to hash each of them once, giving you n hashes in total. Even if it takes k times as much to hash than to compare byte-by-byte, hashing is better when k > (n-1)/2. Hashes may yield false-positives (although strong hashes will only do so with astronomically low probabilities), but testing those byte-by-byte will only increment k by at most 1. With k=3, you will be ahead as soon as n>=7; with a more conservative k=2, you reach break-even with n=3. In practice, I would expect k to be very near to 1: it will probably be more expensive to read from disk than to hash whatever you have read.

The probability that several files will have the same sizes increases with the square of the number of files (look up birthday paradox). Therefore, hashing can be expected to be a very good idea in the general case. It is also a dramatic speedup in case you ever run the tool again, because it can reuse an existing index instead of building it anew. So comparing 1 new file to 1M existing, different, indexed files of the same size can be expected to take 1 hash + 1 lookup in the index, vs. 1M comparisons in the no-hashing, no-index scenario: an estimated 1M times faster!

Note that you can repeat the same argument with a multilevel hash: if you use a very fast hash with, say, the 1st, central and last 1k bytes, it will be much faster to hash than to compare the files (k < 1 above) - but you will expect collisions, and make a second pass with a strong hash and/or a byte-by-byte comparison when found. This is a trade-off: you are betting that there will be differences that will save you the time of a full hash or full compare. I think it is worth it in general, but the "best" answer depends on the specifics of the machine and the workload.

[Update]

The OP seems to be under the impression that

Hashes are slow to calculate
Fast hashes produce collisions
Use of hashing always requires reading the full file contents, and therefore is overkill for files that differ in their 1st bytes.

I have added this segment to counter these arguments:

A strong hash (sha1) takes about 5 cycles per byte to compute, or around 15ns per byte on a modern CPU. Disk latencies for a spinning hdd or an ssd are on the order of 75k ns and 5M ns, respectively. You can hash 1k of data in the time that it takes you to start reading it from an SSD. A faster, non-cryptographic hash, meowhash, can hash at 1 byte per cycle. Main memory latencies are at around 120 ns - there's easily 400 cycles to be had in the time it takes to fulfill a single access-noncached-memory request.
In 2018, the only known collision in SHA-1 comes from the shattered project, which took huge resources to compute. Other strong hashing algorithms are not much slower, and stronger (SHA-3).
You can always hash parts of a file instead of all of it; and store partial hashes until you run into collisions, which is when you would calculate increasingly larger hashes until, in the case of a true duplicate, you would have hashed the whole thing. This gives you much faster index-building.

My points are not that hashing is the end-all, be-all. It is that, for this application, it is very useful, and not a real bottleneck: the true bottleneck is in actually traversing and reading parts of the file-system, which is much, much slower than any hashing or comparing going on with its contents.

Matt Timmermans · Answer

The most important thing you're missing is that comparing two or more large files byte-for-byte while reading them from a real spinning disk can cause a lot of seeking, making it vastly slower than hashing each individually and comparing the hashes.

This is, of course, only true if the files actually are equal or close to it, because otherwise a comparison could terminate early. What you call the "usual algorithm" assumes that files of equal size are likely to match. That is often true for large files generally.

But...

When all the files of the same size are small enough to fit in memory, then it can indeed be a lot faster to read them all and compare them without a cryptographic hash. (an efficient comparison will involve a much simpler hash, though).

Similarly when the number of files of a particular length is small enough, and you have enough memory to compare them in chunks that are big enough, then again it can be faster to compare them directly, because the seek penalty will be small compared to the cost of hashing.

When your disk does not actually contain a lot of duplicates (because you regularly clean them up, say), but it does have a lot of files of the same size (which is a lot more likely for certain media types), then again it can indeed be a lot faster to read them in big chunks and compare the chunks without hashing, because the comparisons will mostly terminate early.

Also when you are using an SSD instead of spinning platters, then again it is generally faster to read + compare all the files of the same size together (as long as you read appropriately-sized blocks), because there is no penalty for seeking.

So there are actually a fair number of situations in which you are correct that the "usual" algorithm is not as fast as it could be. A modern de-duping tool should probably detect these situations and switch strategies.

Fastest algorithm to detect duplicate files

Tags:

performance

algorithm

duplicates

hash

Marcel

2 Answers

tucuxi

Matt Timmermans

Recent Activity

Donate For Us

Fastest algorithm to detect duplicate files

Tags:

performance

algorithm

duplicates

hash

Marcel

2 Answers

tucuxi

Matt Timmermans

Related questions

Recent Activity

Donate For Us