Five Common Mistakes Made by Data Scientists

Five Common Mistakes Made by Data Scientists

I often think about data science as an iterative process which revolves around a business question. This question is integral in determining which data sources to use, how to prepare the data, which modeling techniques to use, and how to measure success of the actions recommended. Along the way, there are numerous traps, waiting to trip up even the most seasoned data scientist. Below are five of the most common mistakes made in the data science process.

1.     Forgetting “The Question”.

Remember, the entire process is built around the business question. Your team may spend weeks building a beautifully complex collection of models, but it will mean nothing if it fails to address the actual question. Often, disconnection from business context results in lost time, trust, and sponsorship. This is further complicated by the fact that the business question itself is rarely fully formed when introduced to the data science team. Remain closely connected to the end users and project sponsors. These folks are your allies, guiding you away from roadblocks and providing crucial context. Every conversation with them is an opportunity to ask clarifying questions and look for the clues which will help to further define the business question.  Approaching the process this way will certainly increase the chance of creating an appropriate and adoptable solution.  

2.     Starting with too much data.

When your team finds themselves working through a process with billions of rows of data, it is nearly impossible to build a solution. Every exploration, visualization, and summary takes forever to run, and valuable time is spent watching the progress bar fill up. 

Rather than using the entire dataset as a building block, look for some natural subset of the data that makes the process manageable. 

  • Geographic—Can you focus on just one region, state, city?
  • Time—Is it possible to only look at the last three months of data? Or the last year?
  • Class Variable—Maybe there is some other variable that can provide a division, such as Brand or Type.
  • Random sample—Maybe the above methods introduce bias that makes you uncomfortable using the subset as a pilot dataset. In this case, consider sampling a random x% (or stratified random x%) of the data to begin building a prototype.

Once the bulk of the exploration, cleaning, and preparation (and maybe some sample models) is complete, the larger data set can be brought back into the process (if necessary). (NOTE: Remember to check that the assumptions made on the smaller subset are not problematic for the new data.)

3.     Making it too complicated.

You’ve just read an article about Support Vector Machines and are super excited to put the methodology to use. Even though your current project is better suited to a simple decision tree, you try to force the more complicated model into the process. Maybe you get an additional 2% accuracy by using SVM, but now you’re hopelessly behind schedule, and have lost all explainability with the stakeholder. Remember…it is about THE QUESTION. The best solution is often the simplest, not the most complicated (shout out to William of Ockham).

4. Forgetting to say “I don’t know”.

Remember that even though you are the expert, nobody expects you to be able to answer every question on the fly. Nobody will think poorly of you for saying those three words. In fact, they will likely respect you more for saying them. Tell them that you need to research further and that you will send an update once you have found an answer. Make a note to yourself so that you can remember to follow up. Coming back to the team with a well-researched answer is infinitely more effective than stumbling through a half answer.


5.     Failing to Validate Results.

Ultimately, the models we build should lead the business towards some recommended action. However, that is not the end of the data science process. Once an action has been taken, it is important to measure its impact. The team should spend a significant amount of time creating a validation plan before implementation. How will data be collected? Which measures will determine success? What are some anticipated problems and some suggested alternatives? Addressing these components of the post-modeling plan will help shape the process itself and ensure that results are meaningful.

I hope this article has provided some insight into some of the most common mistakes made by data scientists, as well as some practical solutions. Which ones did I miss? Let me know in the comments!

Nicholas Pylypiw is a data scientist at Cardinal Solutions in Charlotte, NC. To learn more about the data science process can empower your business, contact info@cardinalsolutions.com.



It is an art to do the (data) science right. Most of the time people get carried away by fancy methods but forget to keep the basic checks in place (all the points you mentioned) to execute the project . Well written Nick!!!

Like
Reply

Assuming ... anything.

Like
Reply

To view or add a comment, sign in

More articles by Nicholas Pylypiw

Others also viewed

Explore content categories