Consolidating Logs

Consolidating Logs

Consolidating logs (like what we just traded) is not sexy but is important for audit, compliance, review, etc.  The challenge is collecting and consolidating these into one or two places and doing it as often as efficiency allows.  And you absolutely need to know when it's not working because of the rain of trouble not being compliant brings.

In a nutshell, Remote Differential Compression (RDC) or "file synchronization" involves synchronizing two files over a computer network while sending the least number of bytes possible.  It has been around since Andrew Tridgell (and others) did his groundbreaking work that lead to rsync.  Since then, Microsoft has pioneered advancements as well.

Current products like rsync are great but since they "copy the whole file", they have more overhead than is really needed.  (And, yes, to be fair, rsync does have an append mode, but this is to recover from dropped transfer and does not work well when the source file has changed significantly).  Many also complain Microsoft's (aside from being a Windows-centric solution) RDC over a LAN adds more overhead than is necessary and is too slow, though a Microsoft TechNet article argues that is not the case.

Trade Logs 

Since most of our customers have a direct need, we embarked on building a flavor of RDC designed for the case where a file is appended to over time.  Because we know the behavior of the underlying file, we were able to build an algorithm specifically designed to handle this case very well.  We call it "append-optimized" RDC.

Append-optimized RDC

Append-optimized RDC works by first deciding if their is a "major change" or not.  We do this first by externally looking at the two files. Is the destination larger than the source? Is the timestamp on the destination newer than the source?  These indicate a major change.  We also look at the file.  We examine the first several blocks, taking a cryptographic checksum and comparing that with the source.  Since these are largely textual logs, the chance of having a matching MD5 sum on the first 4K of two different files, is pretty unlikely.  There are some other tests, too.  I guess a magician can't share all his tricks.

Once we can assume the files are mostly the same except for the additional bytes, we can simply copy all the bytes not in the destination, using proper compression, bandwidth limits, etc.

Performance

Like most things in life, it's all more complicated than it should be, but in the end, the performance does not lie.  

Our internal testing shows we can collect logs being written to at a megabyte per second from several hundred servers and have an acceptable impact.  

We contrived a "worst-case" by firing a sync every 30 seconds on a dozen or so servers with writers running at 1 megabyte/sec.  Individual runs of the sync take about 1800 milliseconds and compression is generally greater than 67% for text file but only 16% for random binary data.  Network impact is not non-existent, but well in line with expectations given the write rate to the logs.  Bandwidth limitation on transfer helps keep the spikiness to a minimum but does cause longer individual runs as the transfer rate is choked down.

In the end, we hope this empowers our customers go build what they need to support their businesses.

To view or add a comment, sign in

More articles by David Carnal

Explore content categories