Save your bandwidth
In my quest to understand how scalable architectures are made, I have started reading all the tech talks presented by companies that work at worldwide scale. Today, I bumped into dropbox.
I won't go into their architecture details, but rather how efficiently they sync their data. What I am going to tell you might be outdated but nevertheless it is a starting point.
Ever heard of rsync utility? Neither did I until today. A great tool for keeping your folders and file in sync not just by merely diffing two files, but by employing logical optimizations.
Let me ask you, how would you sync a file 1gb in size in computer A with a identical file already present in computer B? Please do me a favor of thinking and don't come back saying "Hey use rsync tool". I thought, I gave up. Giving up is okay, but at least you thought.
Anyway, from what I've grasped here it is,
- The file is split into smaller chunks
- Two checksums of each chunk are produced, one simpler in terms of CPU (more collisions e.g. Rolling hash) and one CPU intensive (less collisions e.g. MD5)
- Rather than sending file chunks for diffing, these checksums are passed.
- First the rolling checksum is checked for equality
- If different, send the chunk
- If same, check the MD5 hash and send the chunk only if this hash is different too
The chunks are transferred with necessary elements such as which position of file this chunks belong to e.t.c. Turns out this creates an identical copy of your file and saves hell lot of bandwidth.
Directly from the man page (http://linux.die.net/man/1/rsync): "Rsync finds files that need to be transferred using a lqquick checkrq algorithm (by default) that looks for files that have changed in size or in last-modified time. Any changes in the other preserved attributes (as requested by options) are made on the destination file directly when the quick check indicates that the file's data does not need to be updated." So is rsync just looking for changed file-attributes or is it internally carrying out this small-spilt-checksum-calculations?