What's in a Model?

Thomas Speidel

Published Apr 9, 2017

A key concept in the world of Data Science is that of a model. A model is simply a generalization of a reality expressed in mathematical form.

The overwhelming majority of models in data science and machine learning are probabilistic, because the true state of information available is either limited, unknown or subject to error, and thus, uncertain.

The origin of probabilistic models dates back to the work of Gauss in the 1800s and rapidly grew in the 1900s. Soon, many areas of science began to utilize models to try and empirically explain problems.

These sub-branches go by the name of biostatistics, bioinformatics, psychometrics, econometrics, chemometrics, signal processing, astrostatitics, statistical physics and so on.

A side effect of having many areas of science doing similar work is a multitude and sometimes confusing terminology. For instance, seemingly similar models go by different names in different fields. The term algorithm is often employed by those who trained in computer science.

How do we choose a model?

One cannot simply apply some model to some data. Notwithstanding data preparation, numerous aspects need to be considered as well as a great deal of experimentation. Here are some common points to evaluate:

Domain context on the problem we are trying to solve
The scale of the response: categorical, ordinal, interval, ratio
Objective: predict or explain
The amount of data relative to the number of variables
The ultimate usage of the model
The assumptions being made by a given model
The presence of outliers and extreme data points
Computational complexity
The amount of missing data
Censoring, truncation, repeated measurements, autocorrelation etc.

Evaluation of these aspects can guide the analyst towards a subset of candidate models.

Just how many models are there?

Lots. I'm not aware of a comprehensive list of models. Max Kuhn, author of Applied Predictive Modeling has compiled a list of some 234 models. It is not a comprehensive list and it's admittedly more focused on machine learning models. For instance, there are no time-to-event models or geospatial models. Yet, it gives a glimpse into the world of available models.

I have reproduced Kuhn's list in this Google spreadsheet from his original. Kuhn also provides similarity among models which he nicely illustrates here with a network graph. Each dot represents a model and each network represents a family.

Tensorflow, Deep Learning and Tomorrow's Models

The much publicized Tensorflow, or Deep Learning, are simply another class of models. For instance, deep learning is a type of artificial neural network (neural networks have been around since the 1970's).

These and some modern modelling approaches tend to be data hungry and computationally demanding. The combination of these aspects make them suitable for more sophisticated computational architectures such as parallel and GPU execution.

Availability of Models in Modern Software

The overwhelming majority of models are based on work that has been published in academic journals or conference proceedings. It is not unusual for the very authors to provide the implementation of the model in the form of a package/library in a programming language such as R, Python, Matlab or C++.

Popular models, such as logistic regression, are implemented in most statistical software such as R, SAS, SPSS, Stata, the general purpose Python to name a few, as well as via APIs such as H2O. However, defaults and model arguments usually differ across software.

More modern modelling methods are often available in a handful of languages or API's, the most common being Python and R.

About the Author

Thomas Speidel, P.Stat., is a Canadian Statistician. He spent ten years working in cancer research before moving to the energy industry. Thomas is often seen writing and commenting on issues of statistical literacy on LinkedIn, Twitter, several blog and is a co-founder of About Data Analysis, a LinkedIn group.

LinkedIn: ca.linkedin.com/in/speidel/en

Twitter: @ThomasSpeidel

About.me: http://about.me/Thomas.Speidel

Boris Shmagin 9y

There is a definition for model: “A model is merely your reflection of reality &, like probability, it describes neither you nor the world, but only a relationship between you & that world” Dennis Lindley, from Philosophy of Statistic, 2000

1 Reaction

Denis Avrilionis 9y

Great article! It is never enough to repeat that a model exists in order to represent some specific aspects of the real world. Without a clear objective a model is useless. Beyond data science, creating models as a way to grasp complex problems or to explore solutions is the right approach towards building new tools.

1 Reaction

Maurizio Sanarico 9y

Good review. With regard to models, of course it is practically impossible to list all of the existing models, moreover, their number is increasing without rest.

What's in a Model?

Thomas Speidel

How do we choose a model?

Just how many models are there?

Tensorflow, Deep Learning and Tomorrow's Models

Availability of Models in Modern Software

About the Author

More articles by Thomas Speidel

Others also viewed

The Math You Need to Know for Data Science and Artificial Intelligence

Model Based Machine Learning

SVM Maths Made Simple: Hard vs Soft Margin

Pythonizing Business Efficiency (Part2): Task I — Yield Forecasting Using LSTM

Mathematics: The Backbone of AI and Data Science

Visualization, Math, Time Series, and More: Our Best Recent Deep Dives

The Mathematics of Machine Learning

KDnuggets 16:n32: Data Scientist was sexiest job until…; Up to Speed on Deep Learning

Data Science related articles I have read this week (w/c 13/09/21)

Uniform Points Distribution on Sphere Using Pytorch and Tensorflow

Generalization in weather prediction models

Best Practices For Evaluating Predictive Analytics Models

Machine Learning Models for Breast Cancer Risk Assessment

Model Interpretability and Explainability

Understanding Overfitting In Predictive Analytics

Explore content categories

How do we choose a model?

Just how many models are there?

Tensorflow, Deep Learning and Tomorrow's Models

Availability of Models in Modern Software

About the Author

More articles by Thomas Speidel

Fitting models is what we do for fun, when all the tedious work is done!

Keeping Up With Data Science Innovation Part 1: Podcasts

Single Point or Repeated Decisions?

Single Source of Truth?

Rare Events & Cloud Services: a Winning Synergy?

Substantive & Empirical Models

Do Your Homework Before Analyzing

Statistical Process Control

Yes, But What if You're That One?

Comparing KPI's

Others also viewed

The Math You Need to Know for Data Science and Artificial Intelligence

Model Based Machine Learning

SVM Maths Made Simple: Hard vs Soft Margin

Pythonizing Business Efficiency (Part2): Task I — Yield Forecasting Using LSTM

Mathematics: The Backbone of AI and Data Science

Visualization, Math, Time Series, and More: Our Best Recent Deep Dives

The Mathematics of Machine Learning

KDnuggets 16:n32: Data Scientist was sexiest job until…; Up to Speed on Deep Learning

Data Science related articles I have read this week (w/c 13/09/21)

Uniform Points Distribution on Sphere Using Pytorch and Tensorflow

Similar topics

Generalization in weather prediction models

Best Practices For Evaluating Predictive Analytics Models

Machine Learning Models for Breast Cancer Risk Assessment

Model Interpretability and Explainability

Understanding Overfitting In Predictive Analytics

Explore content categories