Approximating Big Data
I have always held the believe that a good enough algorithm is sufficient to solve a problem in the most efficient manner. By good enough, the first thing that comes to mind is a polynomial space and time algorithm. Well this is certainly true. But there is a catch. We make use of two intrinsic assumptions here :
1. The amount of data or information that is required to solve the problem easily fits into the memory.
2. The algorithm needs to provide an accurate solution to the problem.
Okay, now imagine a world where instead of events happening in some deterministic order, they happen randomly or by chance. It looks scary and entertaining at the same time, isn't it ?? Any moment you win a million dollar lottery and the next moment an asteroid strikes the earth !!! You may say that this is an improbable world. But to scare you a little more, this is not much different from the world where we live in. At the sub-atomic level, events are governed by quantum mechanics which are probabilistic in nature (remember Schrödinger’s Cat). The time scale of events at subatomic level is very small hence they appear to be completely chaotic (Heisenberg Uncertainty Principle), whereas in our macroscopic world, the time scale of events are much greater and thus although events are random, they appear to be governed by our own destiny.
Thus we see that, based on our assumptions we can alter reality as we perceive it. Think the human brain as a computer which runs algorithms to process information in order to solve day to day problems. In our real world, lets say the time scale at which the brain can recognize one event from another be 1 sec, i.e. we can process 60 events in a minute. Now lets say that, you shrink yourself to the quantum realm and now your brain can process events with timescale of 1 nanoseconds. That means you need to process 1 billion events per minute. If your algorithms for processing remains same, then your brain would go on accumulating information until it gets full and it bursts !!!
The assumption number 1 above seems to have been completely invalidated by the advent of "Big Data". It seems that the amount of data being generated every second easily out paces the ongoing developments in increasing the RAM size. A new set of requirement was bound to arise :
a. The amount of memory you can use is only O((log N)^k) for some constant k, where N is the size of the actual data.
b. Running time of algorithms should also be sub-polynomial or preferably constant time O(1) complexities.
But you know that, your algorithm to find the unique elements in a data set satisfies neither of the above two criteria. This is where, even "good" algorithms break down. But since the data is so huge that a small percentage error in the number of unique elements, is acceptable within certain bounds. Some of the very common big data problems dealing with streaming data :
1. Given a billion search queries on Google, unsorted, find whether the query "approximation algorithms" were made any-time in the last one month ?
2. Compute the number of unique visitors to Facebook in the last 90 days. Number of total visits per day exceeds a million in this case.
3. Again given a billion search queries, find the frequency of the top 100 most frequently queried phrases.
4. Find the median number of video views on Youtube per day for the last one year.
5. You need to re-target users with Ads of laptops that he/she just browsed on Amazon. How will you design your Demand Side Platform system to identify returning users ?
6. Given a social graph of people and their likeness of music and movies, determine whether there is a path from a person to a musician ?
The frequent problems arising with large streaming data can be broadly categorized under the following :
1. Searching in the stream. Bloom Filters are probabilistic data structures for searching an element with a stream.
2. Counting number of distinct elements in the stream. HyperLogLog counting is a space efficient logarithmic time algorithm for counting distinct elements.
3. Frequency of elements in a stream. Count-Min sketch is a probabilistic data structure to find frequency of elements in streaming data
4. Heavy Hitters or most frequent elements in a stream. Count-Min sketch along with a min-heap could be used efficiently to get the heavy hitters.
5. Order statistics (mean, median, quantiles) etc. Sampling uniformly from a data stream using Reservoir Sampling and then use some median finding technique like the QuickSelect algorithm.
6. Queries in dynamic and random graphs for social network analysis.
Characteristics of probabilistic data structures and algorithms for streaming big data are that, they take up very less memory O(log N) compared to traditional data structures O(N). The running time complexities are also sub-polynomial. But as they are approximation algorithms meant to be used only with large streaming data, these algorithms are measured in terms of (ϵ,δ), i.e. the result has an error rate of ϵ with a confidence probability of 1−δ. For a fixed ϵ, lower value of δ implies that the approximation algorithm is a stable one and should be preferred.
Here are some resources over the internet that I found to be useful
1. https://highlyscalable.wordpress.com/2012/05/01/probabilistic-structures-web-analytics-data-mining/
2. https://en.wikipedia.org/wiki/Streaming_algorithm
3. http://stackoverflow.com/questions/14403383/bloom-filter-usage
4. http://www.quora.com/What-are-the-best-applications-of-Bloom-filters
5. http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf
6. http://stefanheule.com/papers/edbt13-hyperloglog.pdf
7. http://web.ipac.caltech.edu/staff/fmasci/home/astro_refs/Remedian.pdf
9. http://people.cs.umass.edu/~mcgregor/papers/13-graphsurvey.pdf
10. http://www.cs.dartmouth.edu/~ac/Teach/CS49-Fall11/Notes/lecnotes.pdf