When Approximation is Good Enough

Pranab Ghosh

Published Sep 16, 2015

We engineers are generally paranoid about accuracy of results. However there are many real life situations where such paranoia is not necessary. In such cases, too much accuracy in results is not worth it, especially when considered along with latency involved and the processing costs.

Real Life Examples

It's best to consider some real life examples to put everything in perspective. Here are some.

Number of unique web pages visited in 1 hour
Top 5 videos watched in last 12 hours
Top 10 news stories browsed in last 30 minutes
Whether a particular online coupon has already been redeemed.
Number of web sessions with length above 75% quantile
List of cars within a radius of some location

Approximate Operations

If we consider the examples above, they require one of the following operations.

Count of unique items
Count of frequent items
Whether an item belongs to set
Histogram and percentile
Aggregate query
Range query

These operations can be performed in traditional ways with the data stored in database. Many databases will even natively support the operations listed above.

However, we need fast response to these queries and we are willing to give up accuracy up to a point. This is where approximate algorithms come into the picture.

Approximate Algorithms

These algorithms are characterized as follows. For an in depth overview of these techniques, please refer to my post. For a specific application, here is another post for counting unique mobile app users.

They operate on in memory data structures. The memory foot print of the algorithms is always within reasonable bounds.
They ingest new data into the the data structure through some algorithm.
They provide upper bound on error in query results with an associated lower bound on associated probability of error. For example, the error is less than 5% with probability of 0.98 or more.
Collectively, these algorithms support the operations listed above
These algorithms are typically deployed in a fast real time stream processing context.

These algorithms employ various techniques, including hashing, statistical methods etc.

Reliable Real Time Stream Processing

As mentioned, these algorithms are typically deployed in a real time stream processing environment.

As data is processed the algorithms will update some in memory probabilistic data structure. Queries are processed by probing the same in memory data structure.

Depending the stream processing system and the state management functions it provides, the implementation of these algorithms may not be reliable.

With Storm, there is no native state management and state persistence. If a node crashes, the in memory data structure is gone along with the information it gleaned from the incoming data stream.

With Spark Streaming, check pointing and state management features could be leveraged to provide reliable implementation.

Implementation

Many implementations of approximate algorithms are available as plain Java API in my Github OSS project hoidla.

To view or add a comment, sign in

When Approximation is Good Enough

Pranab Ghosh

Real Life Examples

Approximate Operations

Approximate Algorithms

Reliable Real Time Stream Processing

Implementation

More articles by Pranab Ghosh

Others also viewed

Big O Notation

The API Alchemist: Transforming Data into Powerful Integrations

Persistence Isn’t Memory: Designing True Recall in LangGraph

Why AI Agents Need a New Lakehouse: @Joe Reis Show

You don't have a problem of big data

Address/Entity Resolution - a graphical approach with Spark's Graphframes.

Everything Is Patterns (What Rubik’s Cubes Taught Me About Work and Life)

How to implement Consistent Hashing

By 2026, the Most Dangerous Bugs Won’t Crash Your Pipeline. They’ll Politely Lie to Your Users

Issue #1: Prompt Injection — The TIDE Framework for Production AI Agents

Explore content categories

Real Life Examples

Approximate Operations

Approximate Algorithms

Reliable Real Time Stream Processing

Implementation

More articles by Pranab Ghosh

Does AutoML make Data Scientists obsolete? Not so fast.

Perishable Product Discounting with Reinforcement Learning

Quick and Easy Sentiment Analysis using Google Search Result size and Mutual Information

Black Box Machine Learning may be harmful

Essential Differences between Deep Learning and Conventional Neural Network

Big Data ETL Does Not Have to Cost Big Bucks

The Amazing Power of Generalization

Sometimes the Only Path to Survival is Big Data

Prescriptive Analytics is Predictive Analytics Inverted

You Really Need Spark When ...

Others also viewed

Big O Notation

The API Alchemist: Transforming Data into Powerful Integrations

Persistence Isn’t Memory: Designing True Recall in LangGraph

Why AI Agents Need a New Lakehouse: @Joe Reis Show

You don't have a problem of big data

Address/Entity Resolution - a graphical approach with Spark's Graphframes.

Everything Is Patterns (What Rubik’s Cubes Taught Me About Work and Life)

How to implement Consistent Hashing

By 2026, the Most Dangerous Bugs Won’t Crash Your Pipeline. They’ll Politely Lie to Your Users

Issue #1: Prompt Injection — The TIDE Framework for Production AI Agents

Explore content categories