What's in a Model?
https://www.garudax.id/pulse/support-vector-machine-credit-scoring-rohit-dhankar

What's in a Model?

A key concept in the world of Data Science is that of a model. A model is simply a generalization of a reality expressed in mathematical form.

The overwhelming majority of models in data science and machine learning are probabilistic, because the true state of information available is either limited, unknown or subject to error, and thus, uncertain.

The origin of probabilistic models dates back to the work of Gauss in the 1800s and rapidly grew in the 1900s. Soon, many areas of science began to utilize models to try and empirically explain problems.

These sub-branches go by the name of biostatistics, bioinformatics, psychometrics, econometrics, chemometrics, signal processing, astrostatitics, statistical physics and so on.

A side effect of having many areas of science doing similar work is a multitude and sometimes confusing terminology. For instance, seemingly similar models go by different names in different fields. The term algorithm is often employed by those who trained in computer science.


How do we choose a model?

One cannot simply apply some model to some data. Notwithstanding data preparation, numerous aspects need to be considered as well as a great deal of experimentation. Here are some common points to evaluate:

  • Domain context on the problem we are trying to solve
  • The scale of the response: categorical, ordinal, interval, ratio
  • Objective: predict or explain
  • The amount of data relative to the number of variables
  • The ultimate usage of the model
  • The assumptions being made by a given model
  • The presence of outliers and extreme data points
  • Computational complexity
  • The amount of missing data
  • Censoring, truncation, repeated measurements, autocorrelation etc.

Evaluation of these aspects can guide the analyst towards a subset of candidate models.


Just how many models are there?

Lots. I'm not aware of a comprehensive list of models. Max Kuhn, author of Applied Predictive Modeling has compiled a list of some 234 models. It is not a comprehensive list and it's admittedly more focused on machine learning models. For instance, there are no time-to-event models or geospatial models. Yet, it gives a glimpse into the world of available models.

I have reproduced Kuhn's list in this Google spreadsheet from his original. Kuhn also provides similarity among models which he nicely illustrates here with a network graph. Each dot represents a model and each network represents a family.

Tensorflow, Deep Learning and Tomorrow's Models

The much publicized Tensorflow, or Deep Learning, are simply another class of models. For instance, deep learning is a type of artificial neural network (neural networks have been around since the 1970's).

These and some modern modelling approaches tend to be data hungry and computationally demanding. The combination of these aspects make them suitable for more sophisticated computational architectures such as parallel and GPU execution.


Availability of Models in Modern Software

The overwhelming majority of models are based on work that has been published in academic journals or conference proceedings. It is not unusual for the very authors to provide the implementation of the model in the form of a package/library in a programming language such as R, Python, Matlab or C++.

Popular models, such as logistic regression, are implemented in most statistical software such as R, SAS, SPSS, Stata, the general purpose Python to name a few, as well as via APIs such as H2O. However, defaults and model arguments usually differ across software.

More modern modelling methods are often available in a handful of languages or API's, the most common being Python and R.


About the Author

Thomas Speidel, P.Stat., is a Canadian Statistician. He spent ten years working in cancer research before moving to the energy industry. Thomas is often seen writing and commenting on issues of statistical literacy on LinkedIn, Twitter, several blog and is a co-founder of About Data Analysis, a LinkedIn group.

LinkedIn: ca.linkedin.com/in/speidel/en 

Twitter: @ThomasSpeidel

About.me: http://about.me/Thomas.Speidel



There is a definition for model: “A model is merely your reflection of reality &, like probability, it describes neither you nor the world, but only a relationship between you & that world” Dennis Lindley, from Philosophy of Statistic, 2000

Great article! It is never enough to repeat that a model exists in order to represent some specific aspects of the real world. Without a clear objective a model is useless. Beyond data science, creating models as a way to grasp complex problems or to explore solutions is the right approach towards building new tools.

Good review. With regard to models, of course it is practically impossible to list all of the existing models, moreover, their number is increasing without rest.

To view or add a comment, sign in

More articles by Thomas Speidel

  • Fitting models is what we do for fun, when all the tedious work is done!

    As we continue to evolve at Suncor, I’m really excited about data literacy and technology playing such a big part of…

    4 Comments
  • Keeping Up With Data Science Innovation Part 1: Podcasts

    I get asked a lot how to keep up to date in the field of data science. Here are some of the resources I use.

    2 Comments
  • Single Point or Repeated Decisions?

    The work of a data scientist often results in one of two main outputs: single point decisions and repeated decisions…

    2 Comments
  • Single Source of Truth?

    Administrative data are data collected for the purpose of administering a service or for internal reporting…

    1 Comment
  • Rare Events & Cloud Services: a Winning Synergy?

    Most of my statistical formation happened in cancer research. Several types of cancer are considered rare diseases.

  • Substantive & Empirical Models

    In a previous post, I wrote about what models are and how they are chosen. However, I did not make justice to a broader…

    1 Comment
  • Do Your Homework Before Analyzing

    Many empirical studies confirm that in the course of building predictive and explanatory models, incorporating domain…

    1 Comment
  • Statistical Process Control

    Few months ago, I posted an article on comparing KPI's. I illustrated a methodology that compares the observed KPI to a…

    10 Comments
  • Yes, But What if You're That One?

    Growing up, I recall my aunt buying a lottery ticket nearly every week. I used to tell her to save the money, she…

    2 Comments
  • Comparing KPI's

    Organizations are often interested in comparing performance metrics (KPI) such as web traffic, sales, safety…

    3 Comments

Others also viewed

Explore content categories