Cowboy Data Science

Michael Malak

Published Feb 3, 2015

The opposite of "cowboy coding" is having a solid process, which includes unit testing. Data Science has matured to the point now that we should look to add this as standard practice and start to expect it.

The challenges are:

The first phase of any data science is data exploration. Unit testing is inappropriate for this phase. The problem is that the line between exploration and the next phase, analysis, is blurry. Code developed to support exploration usually makes its way into analysis. The problem comes when subsequent questions are asked, and this once-exploratory analysis code gets tweaked and retweaked, with a great possibility of breaking original assumptions and introducing bugs. Overcoming this challenge is mostly just a matter of discipline, and changing the attitude from "I'm a data scientist, not a unit tester" to "Can I still validly claim I am still in the exploratory phase?"
Many tools do not currently have unit test frameworks, such as Excel, Hive, and all those various command line tools. Other interactive tools such as IPython Notebook and the Spark Shell REPL that support data exploration well do have unit test frameworks available for the underlying languages (Python and Scala, respectively), but as mentioned above, the momentum of once having started exploration in one of these tools leads data scientists down the road of never implementing unit tests (until possibly the phase of integration with true production code). R and Cascading are notable exceptions in the Data Science universe that do have unit testing frameworks available.
Even if tools do have unit testing frameworks, putting them under a common build or other test execution/automation framework is challenging. For example, for R code, one could use a JVM implementation of R in order to get it executing under the Maven umbrella, but that introduces its own set of complications.

A good data science unit test would load known data, execute the transformations and algorithms, and compare against expected results. If after several iterations of tweaking code to answer new questions these expected results from various intermediate processing stages diverge from the actual results, the unit tests will flag them, preventing the reporting of erroneous results.

Plus there are all the other usual benefits of unit tests:

Validates from the outset that the processing and algorithms are behaving as expected
Serves as documentation to readers of the code what the expected inputs and outputs are
Provides confidence during refactoring

Ken Farmer 11y

I wonder if the best way to sell unit-testing to data scientists is to focus on its ability to "reduce cognitive load" rather than verify correctness or enable refactoring: I find that unit-testing helps in diagnosing problems with complex code since it allows me to eliminate distracting uncertainties and zero-in on where the problems must be.

Tom Rampley 11y

Good article. I think one of the biggest hurdles is cultural, because many data scientists come from domains where analysis work is all ad hoc and nothing ever 'goes into production' as a corporate IT department would define it. That's been a major challenge for my team; we're a collection of mathematicians, statisticians, and data analysts who have never written production code and as such unit testing (and even the whole notion of rigorous, QA driven software testing) is a foreign concept. Some of my team didn't even know how to perform error handling in R, because they never needed to learn when doing desktop analytics. In general, I think it would be helpful to have more education for aspiring data scientists on typical IT processes because like it or not that's the world most of them are going to live in outside of academia.

Cowboy Data Science

Michael Malak

More articles by Michael Malak

Others also viewed

7 Tricks to Turbo Charge Pandas for Big Data Processing

Deepchecks for Data and Model Validation

What Data Professionals Should Learn in 2025

Analyzing Missingness

Mastering pandas for Large Datasets: Strategies for Efficient Processing

Tinsel, Transport & Data Tools: Your December Recap

How to turn Fuzzy Requests into Clear Action? Try this 4-Step Framework

🚀 Unraveling the Magic of PCA: Understanding Feature Contributions 🧩

Lazy Evaluation: The Game Changer in Modern Data Platforms

Explore content categories

More articles by Michael Malak

Noam Chomsky: what he got right, what he missed, and the path forward

Neo4j's query language Cypher coming to Spark

Spark Summit 2017 Review

Spark Structured Streaming Supports Kafka Since November 2016

Drizzle Brings Low-Latency Streaming to Spark; but RISE Lab is Just a Change in Funding

All of a sudden, it's OK to say Artificial Intelligence

900TB, 4U, $60k

Spark Summit 2016 Review

Structured Streaming for Lambda Architecture in Spark But Have To Wait For It

Declarative Machine Learning