Finding duplicate files
But when you don't have these tools and you need to get the job done right away using only standard Unix tools on your system; is there an easy way to do it?
As it turns out; there is. It is rather simplistic; it is not the most efficient nor the fastest way to do the job, but if you have time to burn and don't mind exercising your disk a little, this will work.
Note: this article is an expanded version of my previous blog post here.
My method
Actually there is more than one way of doing that using common tools.
My method is simple: compute the (cryptographic) hashes of all the files. Files of identical hashes then would have high-probability of being identical.
But files having the same hashes aren't always identical due to hash collision (this is the nature of hashes). To minimise the false positive; we can run the output of through a second, different (cryptographic) hash; if they also have the same hashes then it is almost certain that they are identical.
My recipe
Without further ado, here is the recipe:
find /root -type f -print0 | xargs -0 md5sum | sort -k1 | awk '{ if ($1==prevmd5) { if (prevfile) print prevfile; print $0; prevfile=""} else { prevmd5=$1; prevfile=$0 }}'
Note: replace /root with the root directory that contains the files you
want to check; e.g. if you want to check for a whole disk then perhaps use
/mnt/sda1 if your sda1
disk is mounted in /mnt/sda1
.
If you to check everything; simply mount all of your disk and then use /
to search for everything (you may want to exclude /dev /proc /sys
if you
wish).
What the above command does it to list all the files (starting at /root)
and then pass all these filenames to md5sum
; and sort them based on their
hashes; and from there display all the files which have the same hashes.
Once you have the output you can then continue with the second hashing like this:
sed 's/.*[ \t]//' | xargs sha1sum | sort -k1 | awk '{ if ($1==prevmd5) { if (prevfile) print prevfile; print $0; prevfile=""} else { prevmd5=$1; prevfile=$0 }}'
It's almost identical to the first one: the sed
strips the hashes so we
are only left with the filenames which then get passed to sha1sum
.
As before we sort them again based on the hashes and display all files that
have the same hashes (sort
and awk
commands used is identical to the
first one).
You don't have to run the second hash through all the files again - we just need to pass those that the first hash has flagged as identical.
The complete recipe
Combining both of them, we get:
find /root -type f -print0 | xargs -0 md5sum | sort -k1 | awk '{ if ($1==prevmd5) { if (prevfile) print prevfile; print $0; prevfile=""} else { prevmd5=$1; prevfile=$0 }}' | sed 's/.*[ \t]//' | xargs sha1sum | sort -k1 | awk '{ if ($1==prevmd5) { if (prevfile) print prevfile; print $0; prevfile=""} else { prevmd5=$1; prevfile=$0 }}' | sed 's/.*[ \t]//'
The final sed is to remove the hashes from display output. What you want to do with these duplicates is then up to you - you can delete them, archive them, or whatever.
Final words
This method isn't particularly efficient. It reads the entire content of all files under whichever root directory you specify (your whole harddisk, for example); and hash all of them under the first hash (md5sum in this example) -- thus it will exercise your harddisk heavily; it will probably saturate your disk I/O bandwidth and eat a lot of your CPU resources.
So do this only if it is necessary. The other programs listed in the summary are smarter; some of them check the file sizes first (if file sizes are different the file cannot be identical), some check file timestamp, some check for hardlinks, etc. They usually work better if you have a lot of small files.
Nevertheless, if you find that your regularly need to find duplicate files,
you may be better served by using a filesystem that can store duplicate files
efficiently instead, like zfs
or btrfs
.
Note: original blog post here.