File De-duplication:

File De-duplication:

Techniques Commonly Used

If you have ever tried to find a program to search for duplicate files, you may have concluded that it’s not as simple of a task as it initially seems. It turns out that there are many different ideas of what constitutes a duplicate, as there are different ways for finding them. I won’t be focusing on any particular application for this tip, but instead focus on the techniques commonly used for file de-duplication.

File name and other metadata

Meta-data has what I would consider likely the best definition of a word ever: data that describes data. The file is the data, the meta-data is the size of the file, file name, file extension, date created, last modified, etc. In the most basic sense, you could consider files of the same name to be duplicates. Being named the same in no way guarantees they are duplicates, but if you know your files it can be useful, especially for things like document drafts or different versions of the same file. File name and other meta-data is not typically used on its own but in conjunction with other methods or as supplemental information.

File content byte-by byte

Another basic method is a simple file comparison. This compares the files byte-by-byte. This method is used for finding files that are binary identical, which is what is typically meant when referring to a duplicate. This is the most accurate method, really the only method that is 100% accurate, technically. It’s also the slowest because it compares every byte of the files. If you are just comparing two files, it’s not a big deal, but if you are scanning a whole file system it can take significantly longer, days, or even weeks.

File content hash

The most common method used is comparing file hashes. There are several advantages to this method. While a byte-by-byte comparison is 100% accurate, it is extremely slow and expensive (in terms of system resources). Hashing is “accurate enough” because while there is a chance two files that are not binary identical may have the same hash, it is so astronomically unlikely that it usually isn’t worth considering. So unlikely in fact that the likelihood is sometimes measure in terms of the age of the universe, as in if you hashed a maxed-out 64bit-inode filesystem every millisecond, you would need about 780 quadrillion times the age of the universe to have a 50% chance of finding a single collision in a single filesystem at 256 bits. Of course, the likelihood varies depending on the hashing algorithm used, but still. The speed also varies depending on the hashing algorithm so it’s a good idea to read up on them and which are supported for whichever tool you are considering.

Heuristics

A more advanced technique would be using some kind of heuristics. That may not be the right word but that’s what I’m sticking with. This method is especially useful for multimedia files. You may have an image, sound, or video file in different formats. For example, you may have the same photo in PNG and JPG. They wouldn’t be binary identical, but in a practical sense, you could consider them duplicates. This is an advanced technique that not as many programs use because it is more specialized and a lot more difficult to implement than standard methods.

More advanced techniques

Some programs will use even more advanced techniques, usually a combination of several techniques. Some basic programs will hash all the files and then compare the hashes. This seems straightforward but isn’t very efficient. For one, files that have different sizes can’t possibly be binary identical. So if a file is the only file of that exact size, it can’t possibly have a duplicate so hashing it is a waste. Additionally, many hashing algorithms produce intermediate hashes, meaning that as it reads and hashes the file, a temporary hash is produced that changes throughout the hashing process until a finalized hash is produced. This is typical of cryptographic hashes. So instead of hashing the entirely of a file, it is more efficient to just hash until the intermediate hashes differ. Although, there is a trade-off with this. While both comparing hashes and comparing byte-by-byte read the entirely of a file, the comparison is what really makes the byte-by-byte comparison slow. Doing too many comparisons of intermediate hashes would also slow it down significantly. On the other hand, comparing intermediate hashes means it doesn’t have to completely hash the files to determine they are different, just if they are the same. This makes it more efficient for larger files.

A note on links

Another thing worth considering is how a program handles links. If you don’t use links, it won’t really matter for you. If you don’t know if you use links, then you don’t. System files use them frequently but the average user doesn’t. There are two types: hard links and soft links (sometimes called symlinks). Soft links use a file that acts as a pointer to another file. This sound like a shortcut file in Windows but it is completely different. Some programs do not follow the links but most will. Hard links are different because they are the exact same file. They aren’t copies, but the same file located at different paths in the file system. As such, hard links don’t take up additional drive space. If your program doesn’t handle hard links, it will consider the files as duplicates. That isn’t necessarily a bad thing, but if your ultimate goal in removing duplicates is to reclaim drive space then deleting a hard link doesn’t really help with that.

Conclusion

As you can see, there’s a bit more to finding duplicate files than it initially seems. This can make it difficult to find a good program that does want you want. In my experience, it’s difficult to even find details on how a program operates beyond the basics. It may search for file hashes but many of them just dumbly hash everything and don’t employ any advanced techniques to reduce unnecessary operations. I will say, one of the best, or at least fastest and most efficient programs out there is rmlint. But unfortunately, it’s only available on Linux and only finds files, leaving you to remove them. The best thing to do really is to just try several and see what works for you.

To view or add a comment, sign in

More articles by Zigabyte

Others also viewed

Explore content categories