Landscape of Big Data: Context and Case Studies
The first of our Landscape of Big Data Series, this posting attempts to describe some of the power and potential of Big Data Analytics via a description of two examples.
The first one describes research on publicly available stock and flight data to determine accuracy and trustworthiness, and the second example looks at whether and how taxis may serve as probes that enable prediction of traffic in Singapore.
Data Trustworthiness
All big data problems are not the same, and so the first step in categorizing your data is to assess it against the 3V's: Volume, Velocity, and Variety. To give you a sense of how variety can impact your analyses, and the methodology that may be used in thinking about data, it is useful to refer to the SUNY and AT&T Labs research described in an article titled "Truth Finding on the Deep Web: Is the Problem solved?" (Li, Dong, Meng, Lyons, and Srivastava)
The team used publicly available stock and flight data to consider:
- If they were consistent?
- If they were accurate?
- If the correct data were provided by the majority of sources?
- If there was an authoritative source that could be trusted?
- If sources were sharing and copying data from each other?
Their analysis uncovered a high inconsistency in the data, many sources of low quality, and a difficulty in being able to identify a single source which users should care about. They also found data sharing between sources, and often of low quality data. Even among well-known authoriatative sources such as Google Finance and Orbitz, they found that while they generally have high accuracy, they are not perfect and therefore cannot be recommended as the "only" source users need to care about.
They also tried to apply data fusion methods to these sources in order to see if current techniques could resolve the value conflicts and find the truth. Think of data fusion as a set of algorithms that you run against your data sets to learn more about their characteristics. There are many different data fusion methods for determining trustworthiness which are described in detail in the article. They conclude that using these methods, you can identify correct values 96% of the time on average. But, they go on to say that it is essential to have accurate information on source trustworthiness, otherwise fusion accuracy can be harmed. To get a sense of how critical it is to understand trustworthiness, you need only think about how inaccurate information duplicated by reference or copied could appear more authoritative due to its frequency. If decisions are made based on this, it could have a negative impact.
We can learn many things from the article. It provides excellent detail about the statistical methods that the team used to measure the data, test its accuracy and how they tried to understand the strengths and limitations of data fusion today. Most importantly it contains a cautionary tale to consider whenever you consider data sets: You need to understand the quality and accuracy of your data, and if there are discrepancies, how you will resolve them.
Predictive Potential: Singapore's Taxi and Traffic Patterns
Professor Daniela Rus' study of the 26,000 taxis and 10,000 road sensors in the country of Singapore over one month's time provided her with 33 GB of data by which to model congestion in the country. In her analysis, she seeks to determine if taxis are a good sample, since there are a lot of other cars on the road, and whether they are an unbiased sample?
To start with, she compares the taxi and sensor data aggregated in 15 minute intervals. The image below shows how she was able to plot these two separate patterns or "signals" on a graph, with the taxi data signal in red and the sensor data signal in blue. When they are laid on top of each other, you can see that the data are matched most closely at times of the day during congestion.
The next question is whether it is possible to predict general traffic patterns using the taxi and loop data? To get the answer, you need to sample one signal at different time intervals to determine which of the time models has the most predictive power. In other words, which is the time interval that allows you to learn the patterns in a correct, repeatable, and generalizable way? She starts with 12 hour intervals, but it is not until she uses 6 and 4 hour intervals that the data start to match better. At a granularity of every 15 minutes, instead of delivering greater accuracy, you begin to get random noise and so you have to be careful not to "overfit the data." You may recall from your statistics class that overfitting occurs when a statistical model describes random error or noise instead of the underlying relationship. In this case, it is the coarser example of the 4-6 hour time intervals that deliver a decent fit and prove that the taxis serve as excellent surrogates for general traffic data.
As she moves through her analysis, Professor Rus proves that the taxi data is a biased sample of overall traffic that is consistent and therefore correctable. It is now possible to determine a probability of the various "states." This can be done using the Markov Chain, an algorithm that allows you to compute the probability of each state in the system. Using this type of analysis allows for the accurate prediction of congestion at different hours of the day across the entire country of Singapore, which may provide transportation managers better information that can enhance traffic management and even empower individual drivers to consider alternative routes.
Conclusion
The purpose of providing these two examples was to challenge you to begin thinking about the diversity of data sets and the types of questions you may wish to ask about your data.
Do not fall prey to thinking that merely by assembling large amounts of data, you will derive more accurate insights faster. Without first establishing the trustworthiness of your sources, you may end up making decisions based on inaccurate data - the opposite of what all this big data power is all about!
Using big data for predictive purposes demands a careful analysis of the "data fit" between the data set you are using for predictive purposes and the one you intend to predict. It is important to understand the limitations of these predictions, and build in double checks when major decisions are being made based on the data.