Data Science Fails

Jon Neff

Published Jul 12, 2019

Update: see the code at https://github.com/jonneff/GirlGamers

Sometimes I look at my projects from years past and think, "Wow! That sucks!" As in, I really didn't know what I was doing or the model didn't perform well, or both. I think this is a good thing. It means I'm getting better.

Case in point: Four years ago I worked on a natural language processing (NLP) project involving about 1 TB of Reddit data. This was one of those projects you do to show your capabilities to a potential employer. The goal was to create a model that would predict whether a comment would be voted up or down. REALLY up or down. I only used the top and bottom 3% of comments in terms of up-votes or down-votes. I used Spark, which I knew well at the time. I scaled up a model that someone else had done for a much smaller amount of Reddit data. I understood the technology, I had a lot of labeled data and I had a model that was already proven to work. What could go wrong?

As it turns out, a LOT could go wrong. For one thing, cool statistical algorithms that work well on small data sets and SHOULD scale but don't. Time and again, I had to re-invent my approach to scale up. Spark behaved badly and was hard to debug. I had an unavoidable huge join right in the middle of processing. That huge join swamped the feeble network cards on the AWS instances I was using. Fun fact: it's not just map-reduce, it's map-SHUFFLE-reduce, and the shuffle can kill you. My biggest failure was simple: I just didn't understand the bias-variance tradeoff. As a result, I had a model that drastically underfit my data. I fed it data until the cows came home and the results just got worse. I eventually got hired anyway- somehow- but not thanks to my model, which had an accuracy of just 51%.

For better or worse, I'm the kind of guy who can't let things go. I went back to my NLP problem recently to see if I could do better. For one thing, I narrowed the problem down to a single Reddit topic. What made me think that a single model could do a good job on topics as different as Girl Gamers and Politics? Instead of a logistic regression problem with, say, 8 variables, I structured it as a language model plus a classifier using recurrent neural networks. I started with a pre-trained language model, used inductive transfer learning to fine-tune the language model for the Reddit Girl Gamers data set, then used the language model encoder in a text classifier that I trained using the up and down votes. The result? I got a model that was about 74% accurate. Not great, but not bad by NLP standards and a heck of a lot better than 51%. The original project took weeks. The re-do took me an afternoon.

Data science is hard. Data engineering is hard. AI is hard. You can have great results on little toy problems and think you know what you're doing. Then you inherit a huge unwashed data set with inconsistencies, incorrect labels, missing data, and stuff that is never discussed in MOOCs. The reality is that most of us are going to have a LOT of failures. So what you gonna do about it? You can bury it or you can dig up the corpse, conduct an autopsy and try to learn from your mistakes. If you want to be successful in a difficult field, you have to be willing to fail. That's the growth mindset.

Patrick Biltgen 6y

Good insights. One of the big barriers to adoption is that the models that work well for small "toy" prototypes have infinite ways they can fail in less clean environments. There is unfortunately a perception that if you can make it work for the toy problem with 99.99% accuracy for recognizing cats, then it's ready to make sentient killer robot drones.

3 Reactions

Peter Goodlin 6y

Great read Jon!

1 Reaction

See more comments

To view or add a comment, sign in

Data Science Fails

Jon Neff

More articles by Jon Neff

Others also viewed

Single perceptron based text classifier applied on sentiment analysis

My First Generative AI Project: SQL Query Generator

Unlocking Knowledge: Revolutionizing Information Access with AI-Driven Chatbots

Using AI to Explore Graph Databases with Neo4j and LangChain

Why Vector Databases Are Really Fast: An In-depth Look at FAISS

Vector Databases Demystified: The What and Why Explained

Get Started with the Weaviate Vector Database on Docker

Hallucinations Aren't Bugs — They're Features

Naive Bayes Classifier — From Scratch

When your only tool is a hammer ...

Explore content categories