The key ingredient to data science (hint: it is not Python)

The key ingredient to data science (hint: it is not Python)

What is the key skill a data scientist should have? If you are thinking of pursuing a career in data science, or you have been looking to hire a data scientist, you have undoubtedly pondered about this question. Possible answers may include any of the following: Python, R, Hadoop, NoSQL, artificial intelligence, natural language processing, machine learning algorithms, statistics, math, data visualization, business acumen, storytelling – to name only a few. So… it is not really easy to declare a winner!

In this article, I will attempt to take a different path to answering this question – starting with the why question and working back toward the skills needed for success. Without spoiling it, let me just say that the answer is not contained in the above list.

WHY are data scientists hired?

Data scientists are not hired to use programming language X or database Y; they are hired to solve hard problems. Solving a problem means finding a solution (Output) given a situation (Input). For the math enthusiasts, this can be written as: Output=F(Input). Well, data scientists find the function F. 

WHAT problems do data scientists solve?

In my experience, problems can be classified into three categories:

Type I problems: Clear Input, Clear Output. “Clear” here means there is nothing ambiguous about what is being given or asked for.

  • Most of the problems we have solved back in school are of this type.
  • Examples: What’s the average call duration in our service center? What’s the monthly revenue across all business units?

✅  Type II problems: Clear Output, Unclear Input.

  • In their most usual form they are known as Fermi problems, the bread and butter of strategy consulting or Wall Street interviews. Google uses them too.
  • Examples: How many people have seen our latest TV ad? How much money is spent in Paris annually on recruitment activities? The output is well-defined (number of people or Euros) but the input is unclear.

✅   Type III problems: Unclear Output, (Un)clear Input

  • This is the most common type found in a business setting.
  • Examples: Can you have a look at the mid-market segment and see if there’s “anything” there? Should we focus more on online growth or on offline efficiency? 

If the problems you would be dealing with are mostly of Type III, then you will need to add an “art” component to the science of problem solving.

HOW do data scientists solve problems? 

Any insight a data scientist may produce comes by way of statistical analysis, and will consequently be statistical in nature. For example, the statement “First-grade pupils are taller than kindergarten children”, while valid, will not hold for any pair of children. The notions of probability, uncertainty, or confidence level are inherent to data science.  

Now, for illustration, let’s assume your boss has asked you to address the Type III example given above: “Should we focus on online growth or on offline efficiency?”. How do you go about finding the solution? A possible path is outlined below:

▹   As the Output is unclear, the very first thing is to clarify it. What is your boss after? Is he or she expecting some high-level research to fuel a workshop on the topic? Or perhaps a few recommendations? Or a detailed action plan? Before you kick off an analysis, ask the right questions to have a clear image of what the Output should look like. If you have succeeded in doing this, then congratulations – you have just transformed your Type III problem into a Type I or Type II situation which is easier to crack. 

▹   Let’s now assume the Output is linked to a profitability measure called X. The next step is to understand how X depends on various quantities (a.k.a. KPIs) related to the online and offline channels: these are your Input variables. Before you throw a correlation analysis at the problem, it may be useful to pause and verify (find experts and ask them questions!) that you have a good grasp on the people, processes and objects involved in the two types of transactions. In other words, make sure you understand your Inputs inside-out. 

▹   Once you have selected a set of Inputs you can move on to mapping the Input-Output relationship, i.e. finding the function F. This usually starts with an analysis of the distributions of all Input and Output variables and proceeds through a series of questions: do the distributions make sense? Are the correlations you measure consistent with what you’d have expected to be directionally true? Are there outliers, i.e. anomalous entries in these distributions? Do you understand the outliers? Etc.

This string of questions generated by your thought process is the essence of data science. It is sometimes called “scientific method”, “mathematical approach” or “logical reasoning”, and it always comes down to the art of asking the questions which guide you to the answer. It is part education, part experience, and part curiosity. Of the three, curiosity is the hardest one to acquire and arguably the key skill to data science.

The thought process – the art of asking questions – is the essence of data science. It is part education, part experience, and part curiosity. Of the three, curiosity is the hardest one to acquire, and is the magic ingredient to data science.


WHAT is the recipe for asking the right question?

This question is relevant even beyond the realm of data science. Google’s Eric Schmidt puts curiosity in the top 2 qualities predicting success. But how do you learn to ask great questions? 

There is no universal recipe I know of, but here are three tips to sharpen this skill: 

➀   Play the game of asking yourself: in the world of unlimited resources at my disposal (people and money) what would I really like to know? Visualizing that Output is a great technique to gain clarity in your thought process.

➁   Introduce a question-based approach to your communication activities. If you are writing a white paper, a blog piece or creating a PowerPoint presentation, start by announcing to the audience the questions you’ll be considering. List those questions on your first page or slide.

➂   Follow “curious” influencers on LinkedIn or Twitter and read interesting thought pieces. Books such as Dataclysm, Freakonomics, or The Signal and the Noise contain spectacular analyses built around a diverse array of great questions. For the fun side of data science, check out the FiveThirtyEight or OKCupid blogs.  

Parting thought

A centuries-old quote says: “It is easier to judge the mind of a man by his questions rather than his answers”. The author was undoubtedly not thinking of data science, but he might have had as well! The quote still rings true today. With automation rising at an unprecedented speed and machines getting smarter each day (think Artificial Intelligence) the ability to be curious about the world will continue to set us apart and see us thrive in the future. 

The original article can be read here.

Not reading this but i hope it say's the key ingredient is 'natural intelligence'

Complex problem solving is of paramount importance in today's world and life. your emphasis on understanding the type of problem, the starting point, the objectives ... and your insights on the methodology are a powerful recipe which is definitely applicable beyond data science. One can tell there is a lot of thought put in this article.

Agreed, all about domain expertise. Coding, SQL, algorithms are all just tools and are often glorified at the expense of critical thinking.

To view or add a comment, sign in

More articles by Catalin Ciobanu

Others also viewed

Explore content categories