Linear Regression
Whenever one embarks on the journey of Machine Learning and reaches the Algorithm stage, Linear Regression is the first algorithm that they land on.
Overview Of Regression:
Linear Regression is used to make statistical prediction on the relationship between the independent and dependent variables. The model works on the principle of 'least square line'. This principal is elaborated subsequently in the article. Here we examine two factors majorly-
- Which variables in particular are significant predictors of the outcome variable
- How significant is the Regression Line to make predictions with highest possible accuracy
Since it is Regression, the outcome will be numerical in nature i.e. it will be quantitative in nature and not qualitative or categorical.
Going by classical mathematics, we can apply Hypothesis theorem to this. Simple Linear Regression is y=mx+c. Where Y is the Dependent variable and X is the independent variable on which the output 'Y' depends. C represents the error which the model was not able to predict. Now as per Null Hypothesis, X will have no impact on Y i.e. X is not significant for predicting Y. Alternative Hypothesis will be the opposite. This equation results in Linear Regression getting subdivided into three types – Simple, Multiple and Polynomial Variable Linear Regression
The equation y=mx+c can be plotted on a graph as shown. The Green dots show the actual values and Blue crosses the predicted ones. The dotted Red Lines represent the error or residual between actual and predicted value. The Yellow line shows the Regression Line or sometimes called as least square line.
There are Multiple ways to calculate the errors. We use the OLS Algorithm (Ordinary Least Square) iteratively. In each iteration the algorithm calculates the sum of the individual squared errors. The next iteration, the algorithm updates the model parameters to shift the line from previous position to reduce the squares error. The Regression Line which has the least error is chosen by the model.
Error Calculations:
Elaboration on the types of errors - In the below figure, ŷ represents the points on the regression line and yn is the actual outcome. Y̅ is the average of all yn.
The line with the lowest sum of squares of error (SSE) has to be chosen
Steps to implement Linear Regression:
- Specify the dependent and independent variable(s)
- Check for linearity
- Check Alternative approaches if variables are not linear
- Estimate the Model
- Test the fit of the model using the coefficient of variation
- Perform a joint hypothesis test on the coefficients
- Perform hypothesis tests on the individual regression coefficients
- Check for variations of the assumptions of regression analysis
- Interpret the results
- Predict the values
Summary of a Liner Regression output in R & interpretation:
The output of a Linear Regression in R is as shown below:
- Std Error of Regression: Represents the average distance that the observed values are away from the regression line. It should be as small as possible. Standard Error of a regression Sy,x is the measure of variability around the line of regression. It can be used the same way as Standard Deviation.
- p-value: If the p value is greater than 0.05 which occurs when t-statistic is less than 2 in absolute value, then it means that the coefficient may be only accidentally significant. For dependency, the P value should be less than 0.05
- Residual Standard Error is RMSE
- Coefficient of Variation (known as R2) is used to determine how closely a regression model 'fits' or explains the relationship between the independent variable X and the dependent variable Y. R2 will vary between 0 and 1 and the closer it is to 1, the better the regression model. Around 80% is good enough.
- Adjusted R2: This is the same as R2 but adjusted for complexity of the model, i.e. the number of parameters in the model. The higher the value of R2, more accurate is the regression model. 0 ≤ R2 ≤ 1. R2=SSR/SST = 1 - SSE/SST
- F-Statistics: This is used to test whether the model outperforms 'noise' as a predictor. The p-value compares the full model with a intercept only model. The futher the p-value from 1, the better
Assumptions for Linear Regression
1. Linear Function form
2. Independent Observation
3. Normality of the Residual or errors
4. Homogenity of resudal of errors
5. No Multicolinearity
6. No Auto-correlation
7. No outlier distortion
Applications for Linear Regression:
- Economic Growth – Can predict the GDP of a country/state
- Product Price
- Housing Sales
- Match Scores
Very well written Ashutosh.