Approximating Big Data

Approximating Big Data

I have always held the believe that a good enough algorithm is sufficient to solve a problem in the most efficient manner. By good enough, the first thing that comes to mind is a polynomial space and time algorithm. Well this is certainly true. But there is a catch. We make use of two intrinsic assumptions here :

1. The amount of data or information that is required to solve the problem easily fits into the memory.

2. The algorithm needs to provide an accurate solution to the problem.

Okay, now imagine a world where instead of events happening in some deterministic order, they happen randomly or by chance. It looks scary and entertaining at the same time, isn't it ?? Any moment you win a million dollar lottery and the next moment an asteroid strikes the earth !!! You may say that this is an improbable world. But to scare you a little more, this is not much different from the world where we live in. At the sub-atomic level, events are governed by quantum mechanics which are probabilistic in nature (remember Schrödinger’s Cat). The time scale of events at subatomic level is very small hence they appear to be completely chaotic (Heisenberg Uncertainty Principle), whereas in our macroscopic world, the time scale of events are much greater and thus although events are random, they appear to be governed by our own destiny.

Thus we see that, based on our assumptions we can alter reality as we perceive it. Think the human brain as a computer which runs algorithms to process information in order to solve day to day problems. In our real world, lets say the time scale at which the brain can recognize one event from another be 1 sec, i.e. we can process 60 events in a minute. Now lets say that, you shrink yourself to the quantum realm and now your brain can process events with timescale of 1 nanoseconds. That means  you need to process 1 billion events per minute. If your algorithms for processing remains same, then your brain would go on accumulating information until it gets full and it bursts !!!

The assumption number 1 above seems to have been completely invalidated by the advent of "Big Data". It seems that the amount of data being generated every second easily out paces the ongoing developments in increasing the RAM size. A new set of requirement was bound to arise :

a. The amount of memory you can use is only O((log N)^k) for some constant k, where N is the size of the actual data.

b. Running time of algorithms should also be sub-polynomial or preferably constant time O(1) complexities.

But you know that, your algorithm to find the unique elements in a data set satisfies neither of the above two criteria. This is where, even "good" algorithms break down. But since the data is so huge that a small percentage error in the number of unique elements, is acceptable within certain bounds. Some of the very common big data problems dealing with streaming data :

1. Given a billion search queries on Google, unsorted, find whether the query "approximation algorithms" were made any-time in the last one month ?

2. Compute the number of unique visitors to Facebook in the last 90 days. Number of total visits per day exceeds a million in this case.

3. Again given a billion search queries, find the frequency of the top 100 most frequently queried phrases.

4. Find the median number of video views on Youtube per day for the last one year.

5. You need to re-target users with Ads of laptops that he/she just browsed on Amazon. How will you design your Demand Side Platform system to identify returning users ?

6. Given a social graph of people and their likeness of music and movies, determine whether there is a path from a person to a musician ?

The frequent problems arising with large streaming data can be broadly categorized under the following :

1. Searching in the stream. Bloom Filters are probabilistic data structures for searching an element with a stream.

2. Counting number of distinct elements in the stream. HyperLogLog counting is a space efficient logarithmic time algorithm for counting distinct elements.

3. Frequency of elements in a stream. Count-Min sketch is a probabilistic data structure to find frequency of elements in streaming data

4. Heavy Hitters or most frequent elements in a stream. Count-Min sketch along with a min-heap could be used efficiently to get the heavy hitters.

5. Order statistics (mean, median, quantiles) etc. Sampling uniformly from a data stream using Reservoir Sampling and then use some median finding technique like the QuickSelect algorithm.

6. Queries in dynamic and random graphs for social network analysis.

Characteristics of probabilistic data structures and algorithms for streaming big data are that, they take up very less memory O(log N) compared to traditional data structures O(N). The running time complexities are also sub-polynomial. But as they are approximation algorithms meant to be used only with large streaming data, these algorithms are measured in terms of (ϵ,δ), i.e. the result has an error rate of ϵ with a confidence probability of 1−δ. For a fixed ϵ, lower value of δ implies that the approximation algorithm is a stable one and should be preferred.

Here are some resources over the internet that I found to be useful

1. https://highlyscalable.wordpress.com/2012/05/01/probabilistic-structures-web-analytics-data-mining/

2. https://en.wikipedia.org/wiki/Streaming_algorithm

3. http://stackoverflow.com/questions/14403383/bloom-filter-usage

4. http://www.quora.com/What-are-the-best-applications-of-Bloom-filters

5. http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf

6. http://stefanheule.com/papers/edbt13-hyperloglog.pdf

7. http://web.ipac.caltech.edu/staff/fmasci/home/astro_refs/Remedian.pdf

8. http://info.prelert.com/blog/q-digest-an-algorithm-for-computing-approximate-quantiles-on-a-collection-of-integers

9. http://people.cs.umass.edu/~mcgregor/papers/13-graphsurvey.pdf

10. http://www.cs.dartmouth.edu/~ac/Teach/CS49-Fall11/Notes/lecnotes.pdf

11. http://www.cs.mcgill.ca/~denis/notes09.pdf

To view or add a comment, sign in

More articles by Abhijit Mondal

  • Don't miss out on the math !!! Even if you are a programmer

    Here I will show how maths play a significant role in one of the most common mathematical operations used by many…

  • The Lean Startup of Skills Development

    Given the pace at which technological developments are progressing and a never-ending amount of things to learn and…

  • From EM to "EM"beddings

    Expectation Maximization is a quite an old tool/concept in the Machine Learning domain. Although it is an old tool but…

  • The "Cost" of my Uber Ride

    Quite often, I take the Uber Pool ride to my office in the morning hours of Bangalore's heavy traffic. Although I get a…

  • Matrix Reloaded

    With the recent speculation from Elon Musk that all of us might be living inside a computer simulation or a video game…

  • The Future lies with Decentralization, or does it ?

    Although Bitcoins and other Cryptocurrencies are hailed as the greatest revolution in financial technology, since it is…

  • How cryptocurrency taught me a better concept of "money"

    First of all let me admit that I do not have a formal economics or finance education and until sometimes back, like…

  • Writing "Better" Code

    Although I cannot boast that I am a hardcore programmer or a hacker for that matter, but being a Data Scientist, I code…

  • Mathematics is Everywhere !!!

    Thankfully I had persisted my interested in mathematics and so I realized that maths is a "Beautiful Monster"…

    2 Comments
  • Too many competitors averaging the talent pool

    I live in Bangalore, the city of start-ups. Almost every "main" and every "cross" in Indiranagar and Koramangala has a…

    3 Comments

Others also viewed

Explore content categories