Find duplicate files

I happen to see a request in the Puppy forum on how to find duplicate files. Sure, there are specialised tools to do that (fdupes, doubles, etc) but what if you don't have access to them?

Turns out it's not difficult at all. This will do it:
find / -type f -print0 | xargs -0 md5sum | sort -k1 | awk '{ if ($1==prevmd5) { if (prevfile) print prevfile; print $0; prevfile=""} else { prevmd5=$1; prevfile=$0 }}'


What the code above does is basically find all files under / (which you can change to something else, e.g. /mnt/sda1, a mountpoint of your disk), then compute the md5sum of these files (you can use sha1sum if you wish, or any other hash programs); and sort the results based on the hashes, and display those that have identical hashes.

Of course, running through all files in your filesystem and computing the md5sum of *all* of them are going to take quite sometime, grind your harddisks, saturate your I/O, and tax your CPU.

And having identical hashes doesn't always mean that the files are identical (although the chance that they aren't are very little); so if you do this with the intent of deleting duplicate files you may want to extend the code a little bit to do full file-comparison when the hashes match.



Posted on 19 Feb 2014, 19:34 - Categories: Linux General
Edit - Delete


No comments posted yet.

Add Comment

Title
Author
 
Content
Show Smilies
Security Code 0589966
Mascot of Fatdog64
Password (to protect your identity)