When Approximation is Good Enough

We engineers are generally paranoid about accuracy of results. However there are many real life situations where such paranoia is not necessary. In such cases, too much accuracy in results is not worth it, especially when considered along with latency involved and the processing costs.

Real Life Examples

It's best to consider some real life examples to put everything in perspective. Here are some.

  1. Number of unique web pages visited in 1 hour
  2. Top 5 videos watched in last 12 hours
  3. Top 10 news stories browsed in last 30 minutes
  4. Whether a particular online coupon has already been redeemed.
  5. Number of web sessions with length above 75% quantile
  6. List of cars within a radius of some location

 

Approximate Operations

If we consider the examples above, they require  one of the following operations.

  1. Count of unique items
  2. Count of frequent items
  3. Whether an item belongs to set
  4. Histogram and percentile
  5. Aggregate query
  6. Range query

These operations can be performed in traditional ways with the data stored in database. Many databases will even natively support the operations listed above.

However, we need fast response to these queries and we are willing to give up accuracy up to a point. This is where approximate algorithms come into the picture.

 

Approximate Algorithms

These algorithms are characterized as follows. For an in depth overview of these techniques, please refer to my post. For a specific application, here is another post for counting unique mobile app users.

  1. They operate on in memory data structures. The memory foot print of the algorithms is always within reasonable bounds.
  2. They ingest new data into the the data structure through some algorithm.
  3. They provide upper bound on error in query results with an associated lower bound on associated probability of error. For example, the error is less than 5% with probability of 0.98 or more.
  4. Collectively, these algorithms support the operations listed above
  5. These algorithms are typically deployed in a fast real time stream processing context.

These algorithms employ various techniques, including hashing, statistical methods etc.

Reliable Real Time Stream Processing

As mentioned, these algorithms are typically deployed in a real time stream processing environment.

As data is processed the algorithms will update some in memory probabilistic data structure. Queries are processed by probing the same in memory data structure.

Depending the stream processing system and the state management functions it provides,  the implementation of these algorithms may not be reliable.

With Storm, there is no native state management and state persistence. If a node crashes, the in memory data structure is gone along with the information it gleaned from the incoming data stream.

With Spark Streaming, check pointing  and state management features could be leveraged to provide reliable implementation.

Implementation

Many implementations of approximate algorithms are available as plain Java API in my Github OSS project hoidla.

To view or add a comment, sign in

More articles by Pranab Ghosh

Others also viewed

Explore content categories