Underfitting and Overfitting in Machine Learning
When we think about data science we come to acknowledge there are no really mind boggling thoughts, simply numerous basic building blocks joined together. A neural network may seem extremely advance, however it's only a blend of various little thoughts. Rather than trying to learn everything at once when we want to develop a model, it's more productive and less frustrating to work through one block at a time. This ensures we have a strong thoughts of the basics and keep away from numerous regular missteps that will hold up others. Besides each piece opens up new ideas enabling us to constantly develop new concepts allowing us to continually build up knowledge until we can create a useful machine learning system and, just as importantly, understand how it works.
What is a Model?
Before talking about underfitting vs overfitting, we need to talk about model, so what is a model? A model is simply a system for mapping inputs to outputs. For example, if we want to predict house prices, we could make a model that takes in size of house and outputs a price. A model represent a theory about a problem: there is some connection between the size of house and price and we make a model to learn that relationship. Models are useful because we can use them to predict the values of outputs for new data points given the inputs.
A model learns relationships between the inputs, called features, and outputs, called labels, from a training dataset. During training the model is given both the features and labels and learns how to map former to latter. A trained model is evaluated on a testing set, where we only give its features and make predictions. We compare the predication with the known labels for testing set to calculate the accuracy. Models can take many shapes, from simple regressions to deep neural networks, but all supervised models are based on fundamental idea of learning relationships between inputs and outputs from training data.
What is Training and testing data?
In machine learning, the study and construction of algorithm that can learn from and make predictions on data is common task. A training dataset is a dataset of examples used for learning, that is to fit the parameters(e.g., size of house). In any real-world process, whether natural or man- made, the data does not exactly fit the trend. There is always noise or other variables in the relationship we cannot measure.
A test dataset is a dataset that is independent of the training dataset, but that follows the same probability distribution as the training dataset. A test set is a set of examples used only to assess the performance of a fully specified classifier.
Now that we know what is model, training and testing dataset, let us divulge into the bigger problem. Let us consider that we are designing a machine learning model. A model is said to be a good machine learning model, if it generalizes any new input data from the problem domain in a proper way. This makes us to make predictions in the future data, the data model has never seen.
Underfitting
A statistical model or machine learning algorithm is said to have underfitting when it cannot capture the underlying trend of the data. Underfitting destroys the accuracy of our machine learning model. Underfitting destroys the accuracy of our machine learning model. Its occurrence simply means that our model or the algorithm does not fit the data well enough. It usually happens when we have less data to build an accurate model and also when we try to build a linear model with a non-linear data. In such cases the rules of the machine learning model are too easy and flexible to be applied on such a minimal data and therefore the model will probably make a lot of wrong predictions. Underfitting can be avoided by using more data and also reducing the features by feature selection.
Overfitting
A statistical model is said to be overfitted, when a function is too closely fit to a set of data points. It occurs when we build models that closely explain a training data set, but fails to generalize when applied to test datasets. Overfitting is caused by higher degree polynomial functions, not having enough data, and having too many features[having a lot of dimensions]. A solution to avoid overfitting is using a linear algorithm if we have linear data or using the parameters like the maximal depth if we are using decision tree.
Examples :
The graph on the left hand side is under-fit and has "High bias", Whereas graph on right hand side is is over-fit and has "High variance". and the graph on the center is "Just fit". Theta's in the graph are parameters for the features. The above figure is an example for linear regression with one feature.
The graph on the left hand side is underfit whereas one on the right is overfit. The center graph is just fit and has high accuracy. The above figure is an example for logistic regression model with binary class.
How to avoid overfitting:
The commonly used methodologies are :
- Cross Validation : A standard way to find out-of-sample prediction error is to use 5-fold cross validation.
- Early stopping : Its rules provide us the guidance as to how many iterations can be run before learner begins to over-fit.
- Pruning : Pruning is extensively used while building related models. It simply removes the nodes which add little predictive power for the problem in hand.
- Regularization : It introduces a cost term for bringing in more features with the objective function. Hence it tries to push the coefficients for many variables to zero and hence reduce cost term.
Summary
This article talks about the problem: overfitting and underfitting while making models using machine learning algorithms. Well, in the further article to come I will talk about how to avoid these problems using machine learning algorithms.
Overfitting can be reduced by using penalized quality estimators to evalute your model. When using MLE, the AICc or Baysian Information Criterion are valuable tools!
In case of time series dataset maybe better Walk-Forward Validation. Overfitting is a sword of Damocles.
There is a fundamental flaw in the reasoning for overfitting in this article. Having a lot of (training) data doesn't inherently harm the model and cause overfitting. But the model complexity (training the model to an excessive degree) may!