Education of a Data Scientist

Recently I was approached to help design the curriculum for a master’s degree program in data science. I accepted this invitation with alacrity since I have recruited several data scientists and have strong opinions about gaps in their education. 

Data science is a broad discipline and data scientists’ job descriptions are varied. But I am going to force them into two distinct categories. The first category is the algorithmists. These are the data scientists who design and implement algorithms, develop software libraries for machine learning and data mining, design and implement data pipelines. These are people who implemented say TensorFlow, python scikit-learn libraries, implemented distributed and parallel algorithms for training deep networks. Their job titles include: Applied Research Scientists, Machine Learning Engineer, Data Engineer and the like. They almost always have a master’s degree, sometimes a PhD.

The second and much larger category is the applied data scientist. These are scientists who apply techniques from machine learning, statistics, data mining, visualization to draw insights, build predictive models and run experiments. They may work for e-commerce companies implementing product recommendations, for media companies implementing content personalization, for public health organizations building models for disease outbreak prediction or evaluating the impact of policies. Lately, there has been a surge in the application of machine learning to areas such as policing, recruiting, and education.

These two categories require different kinds of education and different skills. However, it is my belief that most university curricula cater to the algorithmists and very few provide the kind of training that the applied data scientists need. The university I am working with is trying to fill this gap.

What is the education that applied data scientists need? First the basics: A solid understanding of statistics and probability (especially statistical inference, experiment design, probability distributions) is essential. Second, they need to have a solid understanding of machine learning and data mining algorithms.  In their profession, they will use well established libraries and rarely poke into their internals, but they should have a solid understanding of the underlying assumptions and the appropriate use and limitations of each algorithm. They should be able customize loss functions when appropriate. 

On to courses that are not usually taught in standard data science curricula. 

I advocate a course in “Decision Analysis”. Decision analysis is a well-established discipline mostly taught in business schools. Data scientists will benefit from its concepts and tools. Decision Analysis is the formal discipline of decision making under uncertainty and provides a method of evaluating decision alternatives, outcomes and impact of information. Applied data scientists could do well to understand that most ML models are a component of a larger decision-making process or system. A product recommendation system is a component of a larger e-commerce system, a click-prediction model is part of an ad delivery system, a model for scoring job candidates is part of employee recruitment and management process. A decision analysis framework would help data scientists evaluate the impact of false positives and negatives and weigh the benefits vs the costs of obtaining new data.

A course in “Understanding Data” which encompasses a thorough understanding of data provenance, exploratory data analysis and data visualization is essential. Everyone knows the garbage in garbage out principle, but few pay sufficient attention to it. Few data scientists realize that most big data sets are biased. Modelers need to understand that machine learnt models are often embedded in a human-machine system and influence behavior.  Data generated by behavior exhibits bias due to this influence. There are many other ways in which the data generated may be biased or in other ways unsuitable for model training (I will be discussing this in a separate post). A key component of Understanding Data is Exploratory Data Analysis (EDA), which seems to be a lost art today.

A course on Research Design will teach the budding data scientist how to construct proper descriptive as well as explanatory questions for research, design data collection strategies (such as surveys) and design and run experiments. They will learn how to draw appropriate conclusions from research studies. Strong research skills go hand in hand with the ability to communicate and convince a scientific audience of the research findings. I tell young data scientists that data science is essentially telling stories with data. These are stories told with a strong inductive argument with illustrative visuals. And like all good communicators they should be able to anticipate questions. 

I haven’t mentioned computational skills, but that is a given. A data scientist should be a strong programmer especially using Python. She should also be strong in processing databases of various types including Relational and NoSQL.

A good data scientist should be committed to the domain they are focused on. A data scientist working on public health questions should identify as a public health scientist rather than as a machine learning engineer. A data scientist building predictive models for eCommerce should be interested in marketing and eCommerce. Only then will she know to ask the important questions and make sense of the data.

Since applied data science is all about practice, applied data science curricula should include project-based training. My recommendation is that each student be assigned two mentors on a project -- one a specialist in the techniques to be deployed in the project and the other a specialist in the project domain. In addition, during the project phase of the program the student should be encouraged to take appropriate courses in the project domain that introduces the concepts and methods of the discipline.

Let me end with a story. Recently a team of researchers presented their work on automating the classification of crime as gang-related or not at a conference on Artificial Intelligence, Ethics, and Society (http://www.sciencemag.org/news/2018/02/artificial-intelligence-could-identify-gang-crimes-and-ignite-ethical-firestorm). They used a technique called “Partially Generative Neural Network”. The researchers argued that their model could produce faster and better results than unaided human judgement. The audience questioned the presenter aggressively about the data and what the impact of someone being mislabeled as a gang member, would be. The presenter wasn’t sure. Cornered by these questions, he threw up his hands and said, “I’m just an engineer”! A well-designed data science curriculum should seek to prevent this type of response from its graduates.

Excellent article!  Too many times we see people with the right vocabulary and mild understanding of data science leading people without understanding of data science.

Good classification and narration Prabhakar.

Excellent recommendations! Loved the outlines of the Decision Analysis and Understanding Data courses that you laid out. Definite gaps in the existing education curricula.

To view or add a comment, sign in

More articles by Prabhakar Krishnamurthy

  • Understanding Data Bias

    The huge success of applications of machine learning (ML) applications in the past decade — in image recognition…

    1 Comment

Others also viewed

Explore content categories