Bridging the Gap between Machine Learning Education and Practice

Stephen Minnig

Published Jun 16, 2020

This past semester I taught a graduate-level course in Machine Learning. The students were all working towards their master’s degree in Communication and came to the class with a limited range of computer science experience. As this was my first foray into teaching this topic, I leaned heavily on established approaches to demonstrating Machine Learning practices. Setting aside the need to switch from in-person to on-line teaching halfway through the semester, I found that the established approaches used to teach Machine Learning didn’t offer key lessons necessary for the implementation a Machine Learning project in the real world. This article captures those shortfalls.

A little background: my course followed a typical graduate-level syllabus for teaching Machine Learning fundamentals. Where it deviated from the norm was with the final assignment in which I had each student find a real-world Machine Learning project and analyze it across a range of criteria to include performance and utility. I felt that the capacity to critically examine real-world ML projects would serve these students better than just picking up additional, useful techniques for ML programming (most of which can be acquired independently).

What my students found in looking at real-world Machine Learning projects is that data quality and domain knowledge are central to successful ML project outcomes. However, many ML syllabi use standard examples of tailored datasets (e.g., the MNIST dataset; the Iris flower dataset; the UCI heart disease dataset) that have been optimized for giving proven results. Without guidance, students might not appreciate the level of effort it takes to gather and prepare data for the effective ingestion by the different ML algorithms. This critical conclusion was drawn from dozens of analyzed ML projects and showed that achieving the exceptional model performance seen in traditional teaching examples could only be achieved in the real-world if significant effort was devoted to preparing datasets to a high level of completeness coupled with a more-than-passing understanding of the input dataset’s characteristics. In addition to tailoring input data to the requirements of different ML algorithms (some which can only use scaled, numerical values), my students learned that easily 80% of the effort in building a successful ML project could easily be spent on data preparation even before a modeler should think about the choice of solution method or algorithm.

Gartner recently released a report “Practical Insights From Active Data Science Teams and Mature Machine Learning Strategists” which identified data quality and cultural resistance as the two primary obstacles to the adoption of ML – to a great extent my class’s conclusions represent the other side of the coin. We see that in a rush to build ML models, modelers, often times, used datasets that were incomplete (lots of missing values) or misunderstood (feature definitions are poorly articulated or missing), or modelers attempted to jam the same dataset into different algorithms without consideration of each algorithm’s unique needs and operating characteristics. The results of such actions resulted in model performance significantly lower than what students experience using the traditional teaching projects.

These shortfalls raised reasonable questions regarding the validity of the ML model’s results, especially when evaluations of individual models against a test dataset showed significant drop-offs in model performance (a sure sign of model over-fitting). The class’s conclusions suggested that relevant features need to be properly defined, datasets need to be properly tailored for different computational approaches, and the results must be vetted against the common sense that is built on the domain knowledge of the topic being modeled. From a real-world implementation perspective, the findings suggest that the best ML projects are built on solid data analysis and engineering in which relevant domain knowledge is included in the project team, and there exists a willingness to work with a finite set of ML algorithms once the data are ready. This approach will yield the best results and highest return on ML investment.

Steve Moritz, PMP 5y

Great article Stephen

To view or add a comment, sign in

Bridging the Gap between Machine Learning Education and Practice

Stephen Minnig

More articles by Stephen Minnig

Others also viewed

Machine Learning - Hooked up

Machine Learning Essentials: The 3 Courses You Should Enroll In.

Top 3 Machine Learning Courses to Launch Your AI Career.

OpenThinker3-7B Becomes the Top Open-Data Reasoning Model

The 12 months that's been...

Deep level learning with the Solo Taxonomy

AI Made Practical: 3 Machine Learning Courses to Begin.

Top 10 AI Courses To Take You From Beginner To Advanced

Review of Coursera's Machine Learning Course and Notes Summary

How to build a Student Performance Classifier with Azure Machine Learning

Explore content categories

More articles by Stephen Minnig

Building Better Products, thanks to Maslow

Reducing product launch failures through greater investment in product management

Data Sharing in the Attention Economy (Part 2 - The Value of Attention)

Data Sharing in the Attention Economy (Part 1)