Cowboy Data Science

Cowboy Data Science

Meme from Steve Temple

The opposite of "cowboy coding" is having a solid process, which includes unit testing. Data Science has matured to the point now that we should look to add this as standard practice and start to expect it.

The challenges are:

  • The first phase of any data science is data exploration. Unit testing is inappropriate for this phase. The problem is that the line between exploration and the next phase, analysis, is blurry. Code developed to support exploration usually makes its way into analysis. The problem comes when subsequent questions are asked, and this once-exploratory analysis code gets tweaked and retweaked, with a great possibility of breaking original assumptions and introducing bugs. Overcoming this challenge is mostly just a matter of discipline, and changing the attitude from "I'm a data scientist, not a unit tester" to "Can I still validly claim I am still in the exploratory phase?"
  • Many tools do not currently have unit test frameworks, such as Excel, Hive, and all those various command line tools. Other interactive tools such as IPython Notebook and the Spark Shell REPL that support data exploration well do have unit test frameworks available for the underlying languages (Python and Scala, respectively), but as mentioned above, the momentum of once having started exploration in one of these tools leads data scientists down the road of never implementing unit tests (until possibly the phase of integration with true production code). R and Cascading are notable exceptions in the Data Science universe that do have unit testing frameworks available.
  • Even if tools do have unit testing frameworks, putting them under a common build or other test execution/automation framework is challenging. For example, for R code, one could use a JVM implementation of R in order to get it executing under the Maven umbrella, but that introduces its own set of complications.

A good data science unit test would load known data, execute the transformations and algorithms, and compare against expected results. If after several iterations of tweaking code to answer new questions these expected results from various intermediate processing stages diverge from the actual results, the unit tests will flag them, preventing the reporting of erroneous results.

Plus there are all the other usual benefits of unit tests:

  • Validates from the outset that the processing and algorithms are behaving as expected
  • Serves as documentation to readers of the code what the expected inputs and outputs are
  • Provides confidence during refactoring

I wonder if the best way to sell unit-testing to data scientists is to focus on its ability to "reduce cognitive load" rather than verify correctness or enable refactoring: I find that unit-testing helps in diagnosing problems with complex code since it allows me to eliminate distracting uncertainties and zero-in on where the problems must be.

Like
Reply

Good article. I think one of the biggest hurdles is cultural, because many data scientists come from domains where analysis work is all ad hoc and nothing ever 'goes into production' as a corporate IT department would define it. That's been a major challenge for my team; we're a collection of mathematicians, statisticians, and data analysts who have never written production code and as such unit testing (and even the whole notion of rigorous, QA driven software testing) is a foreign concept. Some of my team didn't even know how to perform error handling in R, because they never needed to learn when doing desktop analytics. In general, I think it would be helpful to have more education for aspiring data scientists on typical IT processes because like it or not that's the world most of them are going to live in outside of academia.

To view or add a comment, sign in

More articles by Michael Malak

Others also viewed

Explore content categories