What's in a Model?
A key concept in the world of Data Science is that of a model. A model is simply a generalization of a reality expressed in mathematical form.
The overwhelming majority of models in data science and machine learning are probabilistic, because the true state of information available is either limited, unknown or subject to error, and thus, uncertain.
The origin of probabilistic models dates back to the work of Gauss in the 1800s and rapidly grew in the 1900s. Soon, many areas of science began to utilize models to try and empirically explain problems.
These sub-branches go by the name of biostatistics, bioinformatics, psychometrics, econometrics, chemometrics, signal processing, astrostatitics, statistical physics and so on.
A side effect of having many areas of science doing similar work is a multitude and sometimes confusing terminology. For instance, seemingly similar models go by different names in different fields. The term algorithm is often employed by those who trained in computer science.
How do we choose a model?
One cannot simply apply some model to some data. Notwithstanding data preparation, numerous aspects need to be considered as well as a great deal of experimentation. Here are some common points to evaluate:
- Domain context on the problem we are trying to solve
- The scale of the response: categorical, ordinal, interval, ratio
- Objective: predict or explain
- The amount of data relative to the number of variables
- The ultimate usage of the model
- The assumptions being made by a given model
- The presence of outliers and extreme data points
- Computational complexity
- The amount of missing data
- Censoring, truncation, repeated measurements, autocorrelation etc.
Evaluation of these aspects can guide the analyst towards a subset of candidate models.
Just how many models are there?
Lots. I'm not aware of a comprehensive list of models. Max Kuhn, author of Applied Predictive Modeling has compiled a list of some 234 models. It is not a comprehensive list and it's admittedly more focused on machine learning models. For instance, there are no time-to-event models or geospatial models. Yet, it gives a glimpse into the world of available models.
I have reproduced Kuhn's list in this Google spreadsheet from his original. Kuhn also provides similarity among models which he nicely illustrates here with a network graph. Each dot represents a model and each network represents a family.
Tensorflow, Deep Learning and Tomorrow's Models
The much publicized Tensorflow, or Deep Learning, are simply another class of models. For instance, deep learning is a type of artificial neural network (neural networks have been around since the 1970's).
These and some modern modelling approaches tend to be data hungry and computationally demanding. The combination of these aspects make them suitable for more sophisticated computational architectures such as parallel and GPU execution.
Availability of Models in Modern Software
The overwhelming majority of models are based on work that has been published in academic journals or conference proceedings. It is not unusual for the very authors to provide the implementation of the model in the form of a package/library in a programming language such as R, Python, Matlab or C++.
Popular models, such as logistic regression, are implemented in most statistical software such as R, SAS, SPSS, Stata, the general purpose Python to name a few, as well as via APIs such as H2O. However, defaults and model arguments usually differ across software.
More modern modelling methods are often available in a handful of languages or API's, the most common being Python and R.
About the Author
Thomas Speidel, P.Stat., is a Canadian Statistician. He spent ten years working in cancer research before moving to the energy industry. Thomas is often seen writing and commenting on issues of statistical literacy on LinkedIn, Twitter, several blog and is a co-founder of About Data Analysis, a LinkedIn group.
LinkedIn: ca.linkedin.com/in/speidel/en
Twitter: @ThomasSpeidel
About.me: http://about.me/Thomas.Speidel
There is a definition for model: “A model is merely your reflection of reality &, like probability, it describes neither you nor the world, but only a relationship between you & that world” Dennis Lindley, from Philosophy of Statistic, 2000
Great article! It is never enough to repeat that a model exists in order to represent some specific aspects of the real world. Without a clear objective a model is useless. Beyond data science, creating models as a way to grasp complex problems or to explore solutions is the right approach towards building new tools.
Good review. With regard to models, of course it is practically impossible to list all of the existing models, moreover, their number is increasing without rest.