One Common Mistake Data Analysts Make and How to Avoid It

Thomas Speidel

Published Jan 19, 2016

One common mistake inexperienced analyst do in analysing data is that of assigning a special role to time. In my line of work, I see this a lot. Signs of chronophrenia, a term I just made up, are the unusual focus on calendar time: from line plots of multiple variables on the y-axis against time on the x-axis, to frequencies or averages by some meaningful time cut-off: monthly, quarterly, yearly and so on.

Often the analyst attempts to visually assess whether a shift in time exists. The hope is that the shift can be tied to a special cause known to the analyst (for instance a change that was introduced at known point in time) and that this would represent evidence in favour or against a hypothesis.
There are a number of drawbacks with this approach. Here I will list a few common ones:

How do we know whether a change in time is due to some special cause that happens at a know time, or whether is due to normal variation?
Time does not possess magical properties: in most fields, seldom is time on the causal pathway. In other words, time does not cause anything. Ok, there are some exceptions, however, at best, time is a proxy measure for something we are unable to measure. Really, think about it: how many times can we think of time having caused something? Death? Not really. Trauma, the inability of cells to reproduce reliably are some of the underlying causes. An old engine breaking down? Think fatigue, structural failure, wear. Changing jobs? Think of complex social issues.

Once you do fully understand a process, time plays no role
Cleves et al.
Suppose we introduce some change at a known point in time. The analyst proceeds to compare, often visually, whether a slope change or a shift exists. The approach is limited unless we can somehow freeze everything else.
Times is an open invitation to slice and dice the data until some interesting results are found. Think about it, we will always find a motivation for looking at things daily, weekly, monthly, quarterly, yearly. These are in fact, meaningful measures to a lot of businesses. Eventually, you are guaranteed to find something interesting or even significant. But it does not mean it's true. But wait! Don't take my word for it! Tyler Vigen assembled a very humorous collection of spurious correlations. Vigen had enough material to fill a whole book. You don't have to buy his book, though I would encourage you to do so. Some are available on Vigen's website.
Visualization pioneer Edward Tufte has a very effective visual demonstration on streak-guessing around time (adapted from the original). See what happens when we randomize the order on time.

So, enough about ranting against time. After all, time to event analysis is one of my areas of expertise. Let me play devil's advocate and list some arguments in favour of time:

When we do not understand a process, time is often a good proxy for something we are unable to measure.
We should try to smooth time to detect trends. I'm a fan of LOESS for its flexibility.
Survival analysis (time to event analysis) is mostly concerned with the rate at which things are moving. Is a certain group reaching an event faster than another group? Survival analysis can effectively deal with survivorship bias.
When we introduce a change at a given time point, there are methods that try to deal with it, such as interrupted time series or change point detection (these can go by different names). For an overview, see Kontopantelis. Change point detection is an active area of research.

If you are embarking on an analysis, don't jump on time as your first go-to measure. Think carefully about the problem and try first to identify all factors that may affect a response of interest. Explore those first, plot them against the response, plot them against each other. Try to learn as much about your problem without recurring to time immediately. When you do look at time, remember there are challenges unique to times that complicates things (autocorrelation and censoring to name a couple).

Remember: "Once you do fully understand a process, time plays no role" ... or almost.

Thomas Speidel, P.Stat., is a Statistician working as Data Scientist for Suncor Energy in Calgary, Alberta. He spent nearly ten years working in cancer research before moving to the "sandbox" of the energy industry. Thomas is often seen writing and commenting on issues of statistical literacy on LinkedIn, Twitter, and several blogs.

Davide Fiocco 10y

Techniques have been developed to deal with this issue and save the day without getting rid of time e.g. https://google.github.io/CausalImpact/CausalImpact.html / http://research.google.com/pubs/pub41854.html.

1 Reaction

Tom Rampley 10y

Good article. Implicit assumption of time as a causal factor when there's not really much reason to believe it to be one is an extremely common error.

1 Reaction

Scott Edwards 10y

Great post -- I think this is partly because most of the focus in forecasting classes is on techniques like ARIMA which really just look at the effect of "time." I think it's harder to find techiques in common use that allow us to look at the effect of other variables OVER time. What do you think of techniques like ARIMAX (dynamic regression)?

1 Reaction

📊 Alastair Muir, PhD, BSc, BEd, MBB, graphic

📊 Alastair Muir, PhD, BSc, BEd, MBB 10y

@Thomas: My favourite is something like, "Sales increased by 50% in January, but only fell by only 33% in February." - the sales version of Simpson's paradox

2 Reactions

Dr. Mark A. Biernbaum 10y

Thomas- I'm a Developmental Psychologist. Time IS the variable of utmost importance. This would be true in many cases of Pediatrics and Geriatrics as well. These are called Developmental Sciences.

One Common Mistake Data Analysts Make and How to Avoid It

Thomas Speidel

More articles by Thomas Speidel

Others also viewed

Chi square Test

5 Red Flags the 'Experts' Don’t Want You to Notice

Data FOMO: The Fear of Missing Out on Metrics That Don’t Matter

Why, How and When to apply Feature Selection

This Tree Can Help You Predict the Future

Zero One Inflated Beta Models for Proportion Data

The Two 'Eyes' of Data

How to explore a dataset with correlationfunnel and gather relevant insights to prepare data for analysis

Normality

How Data Shaped My Passion

Common Mistakes That Prevent Data Job Offers

Common Mistakes in Data Management to Avoid

Common Mistakes in Customer Journey Analysis

How to Avoid Common Data Analysis Errors in Tech

Common Pitfalls In Data Analysis For Scientists

Explore content categories

More articles by Thomas Speidel

Fitting models is what we do for fun, when all the tedious work is done!

Keeping Up With Data Science Innovation Part 1: Podcasts

Single Point or Repeated Decisions?

Single Source of Truth?

Rare Events & Cloud Services: a Winning Synergy?

Substantive & Empirical Models

Do Your Homework Before Analyzing

Statistical Process Control

Yes, But What if You're That One?

What's in a Model?

Others also viewed

Chi square Test

5 Red Flags the 'Experts' Don’t Want You to Notice

Data FOMO: The Fear of Missing Out on Metrics That Don’t Matter

Why, How and When to apply Feature Selection

This Tree Can Help You Predict the Future

Zero One Inflated Beta Models for Proportion Data

The Two 'Eyes' of Data

How to explore a dataset with correlationfunnel and gather relevant insights to prepare data for analysis

Normality

How Data Shaped My Passion

Similar topics

Common Mistakes That Prevent Data Job Offers

Common Mistakes in Data Management to Avoid

Common Mistakes in Customer Journey Analysis

How to Avoid Common Data Analysis Errors in Tech

Common Pitfalls In Data Analysis For Scientists

Explore content categories