Landscape of Big Data: Context and Case Studies

Landscape of Big Data: Context and Case Studies

The first of our Landscape of Big Data Series, this posting attempts to describe some of the power and potential of Big Data Analytics via a description of two examples.

The first one describes research on publicly available stock and flight data to determine accuracy and trustworthiness, and the second example looks at whether and how taxis may serve as probes that enable prediction of traffic in Singapore.

Data Trustworthiness

All big data problems are not the same, and so the first step in categorizing your data is to assess it against the 3V's: Volume, Velocity, and Variety. To give you a sense of how variety can impact your analyses, and the methodology that may be used in thinking about data, it is useful to refer to the SUNY and AT&T Labs research described in an article titled "Truth Finding on the Deep Web: Is the Problem solved?(Li, Dong, Meng, Lyons, and Srivastava)

The team used publicly available stock and flight data to consider:

  1. If they were consistent?
  2. If they were accurate?
  3. If the correct data were provided by the majority of sources?
  4. If there was an authoritative source that could be trusted?
  5. If sources were sharing and copying data from each other?

Their analysis uncovered a high inconsistency in the data, many sources of low quality, and a difficulty in being able to identify a single source which users should care about. They also found data sharing between sources, and often of low quality data. Even among well-known authoriatative sources such as Google Finance and Orbitz, they found that while they generally have high accuracy, they are not perfect and therefore cannot be recommended as the "only" source users need to care about.

They also tried to apply data fusion methods to these sources in order to see if current techniques could resolve the value conflicts and find the truth. Think of data fusion as a set of algorithms that you run against your data sets to learn more about their characteristics. There are many different data fusion methods for determining trustworthiness which are described in detail in the article.  They conclude that using these methods, you can identify correct values 96% of the time on average. But, they go on to say that it is essential to have accurate information on source trustworthiness, otherwise fusion accuracy can be harmed. To get a sense of how critical it is to understand trustworthiness, you need only think about how inaccurate information duplicated by reference or copied could appear more authoritative due to its frequency.  If decisions are made based on this, it could have a negative impact.

We can learn many things from the article. It provides excellent detail about the statistical methods that the team used to measure the data, test its accuracy and how they tried to understand the strengths and limitations of data fusion today. Most importantly it contains a cautionary tale to consider whenever you consider data sets: You need to understand the quality and accuracy of your data, and if there are discrepancies, how you will resolve them.

Predictive Potential: Singapore's Taxi and Traffic Patterns

Professor Daniela Rus' study of the 26,000 taxis and 10,000 road sensors in the country of Singapore over one month's time provided her with 33 GB of data by which to model congestion in the country. In her analysis, she seeks to determine if taxis are a good sample, since there are a lot of other cars on the road, and whether they are an unbiased sample?

To start with, she compares the taxi and sensor data aggregated in 15 minute intervals. The image below shows how she was able to plot these two separate patterns or "signals" on a graph, with the taxi data signal in red and the sensor data signal in blue. When they are laid on top of each other, you can see that the data are matched most closely at times of the day during congestion.

The next question is whether it is possible to predict general traffic patterns using the taxi and loop data? To get the answer, you need to sample one signal at different time intervals to determine which of the time models has the most predictive power. In other words, which is the time interval that allows you to learn the patterns in a correct, repeatable, and generalizable way? She starts with 12 hour intervals, but it is not until she uses 6 and 4 hour intervals that the data start to match better. At a granularity of every 15 minutes, instead of delivering greater accuracy, you begin to get random noise and so you have to be careful not to "overfit the data." You may recall from your statistics class that overfitting occurs when a statistical model describes random error or noise instead of the underlying relationship. In this case, it is the coarser example of the 4-6 hour time intervals that deliver a decent fit and prove that the taxis serve as excellent surrogates for general traffic data.

As she moves through her analysis, Professor Rus proves that the taxi data is a biased sample of overall traffic that is consistent and therefore correctable. It is now possible to determine a probability of the various "states." This can be done using the Markov Chain, an algorithm that allows you to compute the probability of each state in the system. Using this type of analysis allows for the accurate prediction of congestion at different hours of the day across the entire country of Singapore, which may provide transportation managers better information that can enhance traffic management and even empower individual drivers to consider alternative routes.

Conclusion

The purpose of providing these two examples was to challenge you to begin thinking about the diversity of data sets and the types of questions you may wish to ask about your data.

Do not fall prey to thinking that merely by assembling large amounts of data,  you will derive more accurate insights faster.  Without first establishing the trustworthiness of your sources, you may end up making decisions based on inaccurate data - the opposite of what all this big data power is all about!

Using big data for predictive purposes demands a careful analysis of the "data fit" between the data set you are using for predictive purposes and the one you intend to predict. It is important to understand the limitations of these predictions, and build in double checks when major decisions are being made based on the data.

To view or add a comment, sign in

More articles by Reed MacMillan

  • Possibilities

    When I was 13, I complained to my dad about my two problems: 1) I was bored and 2) I didn't have money. When he came…

    10 Comments
  • Thoughts on How to Use Imagin...AI...tion

    As you draw closed the curtains of 2025 and peer at the shiny baby New Year of 2026, you may be pondering what is next…

    1 Comment
  • The Clock's Ticking

    IT modernization principally falls into several categories with which most of the denizens of LinkedIn are very…

  • Tipping Points and Groupthink

    In 2000, Malcom Gladwell published Tipping Point, a book that delved into how ideas gain traction and become viral. I…

    5 Comments
  • Teamwork

    From an early age, I learned that teams are fun. At the age of seven, I joined the swim team at our local pool.

    2 Comments
  • MIT Women's Conference Recap

    Last Thursday, I headed up to Cambridge, MA to participate in the MIT Women's Conference. I set my GPS address and…

    4 Comments
  • Holly Jolly Data Dog Blog

    Do you hear the sleigh bells ringing? Ring-ting-tingling? Are you walking on city sidewalks dressed in holiday style?…

    1 Comment
  • Data Dog Blog

    Data. Data.

    6 Comments
  • Political Animals: Watch Out! The Macroeconomic Elephant in the Room: How Trade Wars Depress the Economy

    Last Friday, the macroeconomic elephant in the room stomped on our microeconomic realities. Senator Lindsey Graham’s…

    7 Comments
  • 52 Candles of Business – Bake these tips into your cake!

    Happy Birthday to Me! The cake is getting crowded…but here’s a list of what I think you need to succeed in business…

    3 Comments

Others also viewed

Explore content categories