Bridging the Gap between Machine Learning Education and Practice

This past semester I taught a graduate-level course in Machine Learning. The students were all working towards their master’s degree in Communication and came to the class with a limited range of computer science experience. As this was my first foray into teaching this topic, I leaned heavily on established approaches to demonstrating Machine Learning practices. Setting aside the need to switch from in-person to on-line teaching halfway through the semester, I found that the established approaches used to teach Machine Learning didn’t offer key lessons necessary for the implementation a Machine Learning project in the real world. This article captures those shortfalls. 

A little background: my course followed a typical graduate-level syllabus for teaching Machine Learning fundamentals. Where it deviated from the norm was with the final assignment in which I had each student find a real-world Machine Learning project and analyze it across a range of criteria to include performance and utility. I felt that the capacity to critically examine real-world ML projects would serve these students better than just picking up additional, useful techniques for ML programming (most of which can be acquired independently). 

What my students found in looking at real-world Machine Learning projects is that data quality and domain knowledge are central to successful ML project outcomes. However, many ML syllabi use standard examples of tailored datasets (e.g., the MNIST dataset; the Iris flower dataset; the UCI heart disease dataset) that have been optimized for giving proven results. Without guidance, students might not appreciate the level of effort it takes to gather and prepare data for the effective ingestion by the different ML algorithms. This critical conclusion was drawn from dozens of analyzed ML projects and showed that achieving the exceptional model performance seen in traditional teaching examples could only be achieved in the real-world if significant effort was devoted to preparing datasets to a high level of completeness coupled with a more-than-passing understanding of the input dataset’s characteristics. In addition to tailoring input data to the requirements of different ML algorithms (some which can only use scaled, numerical values), my students learned that easily 80% of the effort in building a successful ML project could easily be spent on data preparation even before a modeler should think about the choice of solution method or algorithm.

Gartner recently released a report “Practical Insights From Active Data Science Teams and Mature Machine Learning Strategists” which identified data quality and cultural resistance as the two primary obstacles to the adoption of ML – to a great extent my class’s conclusions represent the other side of the coin. We see that in a rush to build ML models, modelers, often times, used datasets that were incomplete (lots of missing values) or misunderstood (feature definitions are poorly articulated or missing), or modelers attempted to jam the same dataset into different algorithms without consideration of each algorithm’s unique needs and operating characteristics.  The results of such actions resulted in model performance significantly lower than what students experience using the traditional teaching projects.  

These shortfalls raised reasonable questions regarding the validity of the ML model’s results, especially when evaluations of individual models against a test dataset showed significant drop-offs in model performance (a sure sign of model over-fitting). The class’s conclusions suggested that relevant features need to be properly defined, datasets need to be properly tailored for different computational approaches, and the results must be vetted against the common sense that is built on the domain knowledge of the topic being modeled. From a real-world implementation perspective, the findings suggest that the best ML projects are built on solid data analysis and engineering in which relevant domain knowledge is included in the project team, and there exists a willingness to work with a finite set of ML algorithms once the data are ready.  This approach will yield the best results and highest return on ML investment.


To view or add a comment, sign in

More articles by Stephen Minnig

Others also viewed

Explore content categories