One Common Mistake Data Analysts Make and How to Avoid It
Image credit: BMJ 2015;350:h2750

One Common Mistake Data Analysts Make and How to Avoid It

One common mistake inexperienced analyst do in analysing data is that of assigning a special role to time. In my line of work, I see this a lot.  Signs of chronophrenia, a term I just made up, are the unusual focus on calendar time: from line plots of multiple variables on the y-axis against time on the x-axis, to frequencies or averages by some meaningful time cut-off: monthly, quarterly, yearly and so on.

Often the analyst attempts to visually assess whether a shift in time exists.  The hope is that the shift can be tied to a special cause known to the analyst (for instance a change that was introduced at known point in time) and that this would represent evidence in favour or against a hypothesis.
There are a number of drawbacks with this approach.  Here I will list a few common ones:

  • How do we know whether a change in time is due to some special cause that happens at a know time,  or whether is due to normal variation?
  • Time does not possess magical properties: in most fields, seldom is time on the causal pathway. In other words,  time does not cause anything. Ok, there are some exceptions, however, at best, time is a proxy measure for something we are unable to measure. Really,  think about it: how many times can we think of time having caused something? Death? Not really. Trauma, the inability of cells to reproduce reliably are some of the underlying causes.  An old engine breaking down?  Think fatigue, structural failure, wear. Changing jobs? Think of complex social issues.  

    Once you do fully understand a process, time plays no role
    Cleves et al.
  • Suppose we introduce some change at a known point in time. The analyst proceeds to compare, often visually, whether a slope change or a shift exists.  The approach is limited unless we can somehow freeze everything else.
  • Times is an open invitation to slice and dice the data until some interesting results are found. Think about it, we will always find a motivation for looking at things daily, weekly, monthly, quarterly, yearly.  These are in fact, meaningful measures to a lot of businesses.  Eventually, you are guaranteed to find something interesting or even significant. But it does not mean it's true.  But wait! Don't take my word for it!  Tyler Vigen assembled a very humorous collection of spurious correlations. Vigen had enough material to fill a whole book.  You don't have to buy his book, though I would encourage you to do so.  Some are available on Vigen's website.  
  • Visualization pioneer Edward Tufte has a very effective visual demonstration on streak-guessing around time (adapted from the original). See what happens when we randomize the order on time.

So, enough about ranting against time. After all, time to event analysis is one of my areas of expertise. Let me play devil's advocate and list some arguments in favour of time:

  • When we do not understand a process, time is often a good proxy for something we are unable to measure.
  • We should try to smooth time to detect trends.  I'm a fan of LOESS for its flexibility.
  • Survival analysis (time to event analysis) is mostly concerned with the rate at which things are moving.  Is a certain group reaching an event faster than another group? Survival analysis can effectively deal with survivorship bias.
  • When we introduce a change at a given time point, there are methods that try to deal with it, such as interrupted time series or change point detection (these can go by different names). For an overview, see Kontopantelis. Change point detection is an active area of research.


If you are embarking on an analysis, don't jump on time as your first go-to measure.  Think carefully about the problem and try first to identify all factors that may affect a response of interest.  Explore those first, plot them against the response, plot them against each other.  Try to learn as much about your problem without recurring to time immediately. When you do look at time, remember there are challenges unique to times that complicates things (autocorrelation and censoring to name a couple).

Remember: "Once you do fully understand a process, time plays no role" ... or almost.

Thomas Speidel, P.Stat., is a Statistician working as Data Scientist for Suncor Energy in Calgary, Alberta. He spent nearly ten years working in cancer research before moving to the "sandbox" of the energy industry. Thomas is often seen writing and commenting on issues of statistical literacy on LinkedIn, Twitter, and several blogs.

Techniques have been developed to deal with this issue and save the day without getting rid of time e.g. https://google.github.io/CausalImpact/CausalImpact.html / http://research.google.com/pubs/pub41854.html.

Good article. Implicit assumption of time as a causal factor when there's not really much reason to believe it to be one is an extremely common error.

Great post -- I think this is partly because most of the focus in forecasting classes is on techniques like ARIMA which really just look at the effect of "time." I think it's harder to find techiques in common use that allow us to look at the effect of other variables OVER time. What do you think of techniques like ARIMAX (dynamic regression)?

@Thomas: My favourite is something like, "Sales increased by 50% in January, but only fell by only 33% in February." - the sales version of Simpson's paradox

Thomas- I'm a Developmental Psychologist. Time IS the variable of utmost importance. This would be true in many cases of Pediatrics and Geriatrics as well. These are called Developmental Sciences.

To view or add a comment, sign in

More articles by Thomas Speidel

  • Fitting models is what we do for fun, when all the tedious work is done!

    As we continue to evolve at Suncor, I’m really excited about data literacy and technology playing such a big part of…

    4 Comments
  • Keeping Up With Data Science Innovation Part 1: Podcasts

    I get asked a lot how to keep up to date in the field of data science. Here are some of the resources I use.

    2 Comments
  • Single Point or Repeated Decisions?

    The work of a data scientist often results in one of two main outputs: single point decisions and repeated decisions…

    2 Comments
  • Single Source of Truth?

    Administrative data are data collected for the purpose of administering a service or for internal reporting…

    1 Comment
  • Rare Events & Cloud Services: a Winning Synergy?

    Most of my statistical formation happened in cancer research. Several types of cancer are considered rare diseases.

  • Substantive & Empirical Models

    In a previous post, I wrote about what models are and how they are chosen. However, I did not make justice to a broader…

    1 Comment
  • Do Your Homework Before Analyzing

    Many empirical studies confirm that in the course of building predictive and explanatory models, incorporating domain…

    1 Comment
  • Statistical Process Control

    Few months ago, I posted an article on comparing KPI's. I illustrated a methodology that compares the observed KPI to a…

    10 Comments
  • Yes, But What if You're That One?

    Growing up, I recall my aunt buying a lottery ticket nearly every week. I used to tell her to save the money, she…

    2 Comments
  • What's in a Model?

    A key concept in the world of Data Science is that of a model. A model is simply a generalization of a reality…

    3 Comments

Others also viewed

Explore content categories