FindingDuplicateFiles

Finding duplicate files

There are many ways to find duplicate files. One can use fdupes, or if one wants to get a little fancy, may be this one will do. There are a dozen other similar utilities; written in various languages of choice.

But when you don't have these tools and you need to get the job done right away using only standard Unix tools on your system; is there an easy way to do it?

As it turns out; there is. It is rather simplistic; it is not the most efficient nor the fastest way to do the job, but if you have time to burn and don't mind exercising your disk a little, this will work.

Note: this article is an expanded version of my previous blog post here.

My method

Actually there is more than one way of doing that using common tools.

My method is simple: compute the (cryptographic) hashes of all the files. Files of identical hashes then would have high-probability of being identical.

But files having the same hashes aren't always identical due to hash collision (this is the nature of hashes). To minimise the false positive; we can run the output of through a second, different (cryptographic) hash; if they also have the same hashes then it is almost certain that they are identical.

If you are still worried about false positives even after the second hashing, you can always do a byte-by-byte comparison on them; but I will not do that here.

My recipe

Without further ado, here is the recipe:

find /root -type f -print0 | xargs -0 md5sum | sort -k1 |
awk '{ if ($1==prevmd5) { if (prevfile) print prevfile; print $0; prevfile=""}
else { prevmd5=$1; prevfile=$0 }}'

Single hashing (with md5sum)

Note: replace /root with the root directory that contains the files you want to check; e.g. if you want to check for a whole disk then perhaps use /mnt/sda1 if your sda1 disk is mounted in /mnt/sda1.

If you to check everything; simply mount all of your disk and then use / to search for everything (you may want to exclude /dev /proc /sys if you wish).

What the above command does it to list all the files (starting at /root) and then pass all these filenames to md5sum; and sort them based on their hashes; and from there display all the files which have the same hashes.

Once you have the output you can then continue with the second hashing like this:

sed 's/.*[ \t]//' | xargs sha1sum | sort -k1 |
awk '{ if ($1==prevmd5) { if (prevfile) print prevfile; print $0; prevfile=""}
else { prevmd5=$1; prevfile=$0 }}'

The second hashing (using sha1sum) (continuation from previous)

It's almost identical to the first one: the sed strips the hashes so we are only left with the filenames which then get passed to sha1sum. As before we sort them again based on the hashes and display all files that have the same hashes (sort and awk commands used is identical to the first one).

You don't have to run the second hash through all the files again - we just need to pass those that the first hash has flagged as identical.

The complete recipe

Combining both of them, we get:

find /root -type f -print0 | xargs -0 md5sum | sort -k1 |
awk '{ if ($1==prevmd5) { if (prevfile) print prevfile; print $0; prevfile=""}
else { prevmd5=$1; prevfile=$0 }}' |
sed 's/.*[ \t]//' | xargs sha1sum | sort -k1 |
awk '{ if ($1==prevmd5) { if (prevfile) print prevfile; print $0; prevfile=""}
else { prevmd5=$1; prevfile=$0 }}' |
sed 's/.*[ \t]//'

Find duplicate files using double-hashing.

The final sed is to remove the hashes from display output. What you want to do with these duplicates is then up to you - you can delete them, archive them, or whatever.

The script works fine as if is you don't have funny filenames. If you do have funny filenames (filenames that contains spaces, newlines, asterisks, question marks and/or starts with a dash), you will need to take a precaution on the second hashing (the first one is fine). There are many fine ways of doing that, just search for it.

Final words

This method isn't particularly efficient. It reads the entire content of all files under whichever root directory you specify (your whole harddisk, for example); and hash all of them under the first hash (md5sum in this example) -- thus it will exercise your harddisk heavily; it will probably saturate your disk I/O bandwidth and eat a lot of your CPU resources.

So do this only if it is necessary. The other programs listed in the summary are smarter; some of them check the file sizes first (if file sizes are different the file cannot be identical), some check file timestamp, some check for hardlinks, etc. They usually work better if you have a lot of small files.

Nevertheless, if you find that your regularly need to find duplicate files, you may be better served by using a filesystem that can store duplicate files efficiently instead, like zfs or btrfs.

Note: original blog post here.