Linear Regression

Ashutosh Chaudhary

Published May 17, 2020

Whenever one embarks on the journey of Machine Learning and reaches the Algorithm stage, Linear Regression is the first algorithm that they land on.

Overview Of Regression:

Linear Regression is used to make statistical prediction on the relationship between the independent and dependent variables. The model works on the principle of 'least square line'. This principal is elaborated subsequently in the article. Here we examine two factors majorly-

Which variables in particular are significant predictors of the outcome variable
How significant is the Regression Line to make predictions with highest possible accuracy

Since it is Regression, the outcome will be numerical in nature i.e. it will be quantitative in nature and not qualitative or categorical.

Going by classical mathematics, we can apply Hypothesis theorem to this. Simple Linear Regression is y=mx+c. Where Y is the Dependent variable and X is the independent variable on which the output 'Y' depends. C represents the error which the model was not able to predict. Now as per Null Hypothesis, X will have no impact on Y i.e. X is not significant for predicting Y. Alternative Hypothesis will be the opposite. This equation results in Linear Regression getting subdivided into three types – Simple, Multiple and Polynomial Variable Linear Regression

The equation y=mx+c can be plotted on a graph as shown. The Green dots show the actual values and Blue crosses the predicted ones. The dotted Red Lines represent the error or residual between actual and predicted value. The Yellow line shows the Regression Line or sometimes called as least square line.

There are Multiple ways to calculate the errors. We use the OLS Algorithm (Ordinary Least Square) iteratively. In each iteration the algorithm calculates the sum of the individual squared errors. The next iteration, the algorithm updates the model parameters to shift the line from previous position to reduce the squares error. The Regression Line which has the least error is chosen by the model.

Error Calculations:

Elaboration on the types of errors - In the below figure, ŷ represents the points on the regression line and yn is the actual outcome. Y̅ is the average of all yn.

The line with the lowest sum of squares of error (SSE) has to be chosen

Steps to implement Linear Regression:

Specify the dependent and independent variable(s)
Check for linearity
Check Alternative approaches if variables are not linear
Estimate the Model
Test the fit of the model using the coefficient of variation
Perform a joint hypothesis test on the coefficients
Perform hypothesis tests on the individual regression coefficients
Check for variations of the assumptions of regression analysis
Interpret the results
Predict the values

Summary of a Liner Regression output in R & interpretation:

The output of a Linear Regression in R is as shown below:

Std Error of Regression: Represents the average distance that the observed values are away from the regression line. It should be as small as possible. Standard Error of a regression Sy,x is the measure of variability around the line of regression. It can be used the same way as Standard Deviation.
p-value: If the p value is greater than 0.05 which occurs when t-statistic is less than 2 in absolute value, then it means that the coefficient may be only accidentally significant. For dependency, the P value should be less than 0.05
Residual Standard Error is RMSE
Coefficient of Variation (known as R2) is used to determine how closely a regression model 'fits' or explains the relationship between the independent variable X and the dependent variable Y. R2 will vary between 0 and 1 and the closer it is to 1, the better the regression model. Around 80% is good enough.
Adjusted R2: This is the same as R2 but adjusted for complexity of the model, i.e. the number of parameters in the model. The higher the value of R2, more accurate is the regression model. 0 ≤ R2 ≤ 1. R2=SSR/SST = 1 - SSE/SST
F-Statistics: This is used to test whether the model outperforms 'noise' as a predictor. The p-value compares the full model with a intercept only model. The futher the p-value from 1, the better

Assumptions for Linear Regression

1. Linear Function form

2. Independent Observation

3. Normality of the Residual or errors

4. Homogenity of resudal of errors

5. No Multicolinearity

6. No Auto-correlation

7. No outlier distortion

Applications for Linear Regression:

Economic Growth – Can predict the GDP of a country/state
Product Price
Housing Sales
Match Scores

Linear Regression

Ashutosh Chaudhary

Overview Of Regression:

Error Calculations:

Steps to implement Linear Regression:

Summary of a Liner Regression output in R & interpretation:

Assumptions for Linear Regression

Applications for Linear Regression:

More articles by this author

Others also viewed

Understanding Linear Regression Evaluation Metrics

Predicting the Estimated Ultimate Recovery with Regression Analysis

Machine Learning

Machine Learning Day 4 -Regression Algorithms

Demystifying Linear Regression and Its Role in Machine Learning

Deep Dive: Linear Regression