Data Science Fails

Data Science Fails

Update: see the code at https://github.com/jonneff/GirlGamers

Sometimes I look at my projects from years past and think, "Wow! That sucks!" As in, I really didn't know what I was doing or the model didn't perform well, or both. I think this is a good thing. It means I'm getting better.

Case in point: Four years ago I worked on a natural language processing (NLP) project involving about 1 TB of Reddit data. This was one of those projects you do to show your capabilities to a potential employer. The goal was to create a model that would predict whether a comment would be voted up or down. REALLY up or down. I only used the top and bottom 3% of comments in terms of up-votes or down-votes. I used Spark, which I knew well at the time. I scaled up a model that someone else had done for a much smaller amount of Reddit data. I understood the technology, I had a lot of labeled data and I had a model that was already proven to work. What could go wrong? 

As it turns out, a LOT could go wrong. For one thing, cool statistical algorithms that work well on small data sets and SHOULD scale but don't. Time and again, I had to re-invent my approach to scale up. Spark behaved badly and was hard to debug. I had an unavoidable huge join right in the middle of processing. That huge join swamped the feeble network cards on the AWS instances I was using. Fun fact: it's not just map-reduce, it's map-SHUFFLE-reduce, and the shuffle can kill you. My biggest failure was simple: I just didn't understand the bias-variance tradeoff. As a result, I had a model that drastically underfit my data. I fed it data until the cows came home and the results just got worse. I eventually got hired anyway- somehow- but not thanks to my model, which had an accuracy of just 51%. 

For better or worse, I'm the kind of guy who can't let things go. I went back to my NLP problem recently to see if I could do better. For one thing, I narrowed the problem down to a single Reddit topic. What made me think that a single model could do a good job on topics as different as Girl Gamers and Politics? Instead of a logistic regression problem with, say, 8 variables, I structured it as a language model plus a classifier using recurrent neural networks. I started with a pre-trained language model, used inductive transfer learning to fine-tune the language model for the Reddit Girl Gamers data set, then used the language model encoder in a text classifier that I trained using the up and down votes. The result? I got a model that was about 74% accurate. Not great, but not bad by NLP standards and a heck of a lot better than 51%. The original project took weeks. The re-do took me an afternoon. 

Data science is hard. Data engineering is hard. AI is hard. You can have great results on little toy problems and think you know what you're doing. Then you inherit a huge unwashed data set with inconsistencies, incorrect labels, missing data, and stuff that is never discussed in MOOCs. The reality is that most of us are going to have a LOT of failures. So what you gonna do about it? You can bury it or you can dig up the corpse, conduct an autopsy and try to learn from your mistakes. If you want to be successful in a difficult field, you have to be willing to fail. That's the growth mindset. 

Good insights. One of the big barriers to adoption is that the models that work well for small "toy" prototypes have infinite ways they can fail in less clean environments. There is unfortunately a perception that if you can make it work for the toy problem with 99.99% accuracy for recognizing cats, then it's ready to make sentient killer robot drones.

To view or add a comment, sign in

More articles by Jon Neff

  • How Do You Know That is True? Answering Questions With Small Data

    If you want to cause a commotion, the next time you hear someone say anything about how the world works ask the…

  • Insight Applications Open

    The Insight Fellows program I participated in is accepting applications for the summer sessions of Data Engineering and…

  • Insight Applications Open

    The Fellows program I participated in is accepting applications for its January program. Learn more and apply at…

  • When Big Data is Too Much of a Good Thing

    Statisticians intimidate me. I feel self-conscious talking to them about data.

    1 Comment
  • The Curse of Dimensionality

    The Curse of Dimensionality Every now and then I run across something so simple yet profound that it makes me think…

  • Insight Data Engineering

    The Fellows program I participated in has applications open for the September program. Learn more and apply at…

Others also viewed

Explore content categories