Linear Regression in MachineLearning

Sunil Kumar Cheruku

Published Oct 19, 2019

Linear Regression is a machine learning algorithm based on supervised learning. It performs a regression task.

Regression models a target prediction value based on independent variables.

It is mostly used for finding out the relationship between variables and forecasting. Different regression models differ based on – the kind of relationship between the dependent and independent variables, they are considering and the number of independent variables being used.

Linear regression performs the task to predict a dependent variable value (y) based on a given independent variable (x). So, this regression technique finds out a linear relationship between x (input) and y(output). Hence, the name is Linear Regression.

In the figure above, X (input) is the work experience and Y (output) is the salary of a person. The regression line is the best fit line for our model.

While training the model we are given :

x: input training data (univariate – one input variable(parameter))

y: labels to data (supervised learning)

When training the model – it fits the best line to predict the value of y for a given value of x. The model gets the best regression fit line by finding the best θ1 and θ2 values.

θ1: intercept

θ2: coefficient of x

Once we find the best θ1 and θ2 values, we get the best fit line. So when we are finally using our model for prediction, it will predict the value of y for the input value of x.

Cost Function (J):

By achieving the best-fit regression line, the model aims to predict y value such that the error difference between the predicted value and true value is minimum. So, it is very important to update the θ1 and θ2 values, to reach the best value that minimizes the error between predicted y value (pred) and true y value (y).

Cost function(J) of Linear Regression is the Root Mean Squared Error (RMSE) between predicted y value (pred) and true y value (y).

Ordinary Least Squares (OLS):

Ordinary Least Squares regression (OLS) is more commonly named linear regression (simple or multiple depending on the number of explanatory variables).

The OLS method corresponds to minimizing the sum of squares differences between the observed and predicted values.

Linear Regression Using Scikitlearn library:

In the above code, the 'score' is equal to R-square value, we can find R-square value using sklearn.metrics

Assumptions for linear regression:

The linear relationship between the independent variable and the dependent variable.
Low or no multicollinearity between the independent variables.
All variables to be multivariate normal ( This assumption can best be checked with a histogram or a Q-Q-Plot).
No autocorrelation.
Homoscedaticity (distribution of residuals same across).

Homoscedasticity describes a situation in which the error term (that is, the “noise” or random disturbance in the relationship between the features and the target) is the same across all values of the independent variables. A scatter plot of residual values vs predicted values is a good way to check for homoscedasticity. There should be no clear pattern in the distribution and if there is a specific pattern, the data is heteroscedastic.

The leftmost graph shows no definite pattern i.e constant variance among the residuals, the middle graph shows a specific pattern where the error increases and then decreases with the predicted values violating the constant variance rule and the rightmost graph also exhibits a specific pattern where the error decreases with the predicted values depicting heteroscedasticity

How to choose the best regression model:

Adjusted R-squared and Predicted R-squared: Generally, you choose the models that have higher adjusted and predicted R-squared values. These statistics are designed to avoid a key problem with regular R-squared - it increases every time you add a predictor and can trick you into specifying an overly complex model.

The adjusted R squared increases only if the new term improves the model more than would be expected by chance and it can also decrease with poor quality predictors.
The predicted R-squared is a form of cross-validation and it can also decrease. Cross-validation determines how well your model generalizes to other data sets by partitioning your data.

P-values for the predictors: In regression, low p-values indicate terms that are statistically significant. “Reducing the model” refers to the practice of including all candidate predictors in the model, and then systematically removing the term with the highest p-value one-by-one until you are left with only significant predictors.

Null Hypothesis for the OLS model is H0: Both variables are not correlated(there is no association between the two variables).

Gradient Descent:

The loss is the error in our predicted value of m and c. Our goal is to minimize this error to obtain the most accurate value of m and c.

We will use the Mean Squared Error function to calculate the loss. There are three steps in this function:

Find the difference between the actual y and predicted y value(y = mx + c), for a given x.
Square this difference.
Find the mean of the squares for every value in X.

Here yᵢ is the actual value and ȳᵢ is the predicted value. Let's substitute the value of ȳᵢ:

So we square the error and find the mean. hence the name Mean Squared Error. Now that we have defined the loss function, let's get into the interesting part — minimizing it and finding m and c.

Gradient descent is an iterative optimization algorithm to find the minimum of a function. Here that function is our Loss Function.

Understanding Gradient Descent

Imagine a valley and a person with no sense of direction who wants to get to the bottom of the valley. He goes down the slope and takes large steps when the slope is steep and small steps when the slope is less steep. He decides his next position based on his current position and stops when he gets to the bottom of the valley which was his goal.

Let’s try applying gradient descent to m and c and approach it to step by step:

Initially let m = 0 and c = 0. Let L be our learning rate. This controls how much the value of m changes with each step. L could be a small value like 0.0001 for good accuracy.
Calculate the partial derivative of the loss function with respect to m, and plug in the current values of x, y, m and c in it to obtain the derivative value D.

Dₘ is the value of the partial derivative with respect to m. Similarly, let's find the partial derivative with respect to c, Dc :

3. Now we update the current value of m and c using the following equation:

4. We repeat this process until our loss function is a very small value or ideally 0 (which means 0 error or 100% accuracy). The value of m and c that we are left with now will be the optimum values.

Now going back to our analogy, m can be considered the current position of the person. D is equivalent to the steepness of the slope and L can be the speed with which he moves. Now the new value of m that we calculate using the above equation will be his next position, and L×D will be the size of the steps he will take. When the slope is steeper (D is more) he takes longer steps and when it is less steep (D is less), he takes smaller steps. Finally, he arrives at the bottom of the valley which corresponds to our loss = 0.

Now with the optimum value of 'm' and 'c' our model is ready to make predictions!

If the learning rate constant, we will encounter oscillation problems (it will jump from one side to another side.

Stochastic Gradient Descent:

when data size is large gradient descent takes a lot of time to calculate optimized values.

In both gradient descent (GD) and stochastic gradient descent (SGD), you update a set of parameters in an iterative manner to minimize an error function.

While in GD, you have to run through ALL the samples in your training set to do a single update for a parameter in a particular iteration, in SGD, on the other hand, you use ONLY ONE or SUBSET of training sample from your training set to do the update for a parameter in a particular iteration. If you use SUBSET, it is called Minibatch Stochastic gradient Descent.

Thus, if the number of training samples is large, in fact very large, then using gradient descent may take too long because in every iteration when you are updating the values of the parameters, you are running through the complete training set. On the other hand, using SGD will be faster because you use only one training sample and it starts improving itself right away from the first sample.

SGD often converges much faster compared to GD but the error function is not as well minimized as in the case of GD. Often in most cases, the close approximation that you get in SGD for the parameter values are enough because they reach the optimal values and keep oscillating there

We can implement SGD using sklearn

Real-World Cases:

Impact of Imbalanced Data:

Linear regression is not a classifier, the output is real value hence imbalanced data is not has any impact on linear regression.

Feature Engineering/Feature Transformation:

Transform the features in such a way that it gives linear implementation.

Feature Importance:

We can get feature importance by L1-regularization but we have to ensure that there is no multicollinearity.

Impact of Missing Values:

Linear Regression gets impacted by missing values, need to handle missing values by pandas/Imputer model.

Impact of Outliers:

Logistic Regression not getting impacted by outliers as the sigmoid function is limiting outliers.

Linear Regression gets impacted by outliers, the squared error will become high. we need to train the data and find W and find the points which are away using y - Yi.

Remove those points which have a high value of y - Yi.

Time and Space Complexity:

Time Complexity: p - number of features, n - number of datapoints

Train time: O(p^2 + n)

Test time: O(p)

Space Complexity:

Train and Test: O(p)

Interview Questions:

1. Linear Regression is a supervised machine learning algorithm? Ans: Yes

2. Linear Regression is mainly used for Regression? Ans: Yes

3. It is possible to design a Linear regression algorithm using a neural network? Ans: Yes

4. Which of the following methods do we use to find the best fit line for data in Linear Regression? Ans: A

A) Least Square Error

B) Maximum Likelihood

C) Logarithmic Loss

D) Both A and B

5. Which of the following evaluation metrics can be used to evaluate a model while modeling a continuous output variable? Ans: D

A) AUC-ROC

B) Accuracy

C) Log loss

D) Mean-Squared-Error

Since linear regression gives output as continuous values, so in such case, we use the mean squared error metric to evaluate the model performance. Remaining options are used in case of a classification problem.

6. Lasso Regularization can be used for variable selection in Linear Regression? Ans: A

A) TRUE

B) FALSE

True, In the case of lasso regression we apply absolute penalty which makes some of the coefficients zero.

7. Which of the following is true about Residuals? Ans: A

A) Lower is better

B) Higher is better

C) A or B depending on the situation

D) None of these

Residuals refer to the error values of the model. Therefore lower residuals are desired.

8. Suppose that we have N independent variables (X1, X2… Xn) and the dependent variable is Y. Now imagine that you are applying linear regression by fitting the best fit line using least square error on this data? Ans: B

You found that the correlation coefficient for one of its variables (Say X1) with Y is -0.95.

Which of the following is true for X1?

A) The relation between the X1 and Y is weak

B) Relation between the X1 and Y is strong

C) Relation between the X1 and Y is neutral

D) Correlation can’t judge the relationship

The absolute value of the correlation coefficient denotes the strength of the relationship. Since absolute correlation is very high it means that the relationship is strong between X1 and Y.

9. Looking at the above two characteristics, which of the following option is the correct for Pearson correlation between V1 and V2? Ans: D

If you are given the two variables V1 and V2 and they are following below two characteristics.

1. If V1 increases then V2 also increases

2. If V1 decreases then V2 behavior is unknown

A) Pearson correlation will be close to 1

B) Pearson correlation will be close to -1

C) Pearson correlation will be close to 0

D) None of these

We cannot comment on the correlation coefficient by using only statement 1. We need to consider both of these two statements. Consider V1 as x and V2 as |x|. The correlation coefficient would not be close to 1 in such a case.

10. Suppose the Pearson correlation between V1 and V2 is zero. In such a case, is it right to conclude that V1 and V2 do not have any relation between them? Ans: False

Pearson correlation coefficient between 2 variables might be zero even when they have a relationship between them. If the correlation coefficient is zero, it just means that that they don’t move together. We can take examples like y=|x| or y=x^2.

11. Which of the following offsets, do we use in linear regression’s least-square line fit? Suppose the horizontal axis is the independent variable and the vertical axis is the dependent variable? Ans: B

A) Vertical offset

B) Perpendicular offset

C) Both, depending on the situation

D) None of the above

We always consider residuals as vertical offsets. We calculate the direct differences between the actual value and the Y labels. Perpendicular offset is useful in the case of PCA.

Question 12 - 14:

Suppose you have fitted a complex regression model on a dataset. Now, you are using Ridge regression with penalty x.

12. Choose the option which describes bias in the best manner? Ans: B

A) In case of very large x; bias is low

B) In case of very large x; bias is high

C) We can’t say about bias

D) None of these

If the penalty is very large it means the model is less complex, therefore the bias would be high.

13. What will happen when you apply a very large penalty? Ans: B

A) Some of the coefficients will become absolute zero

B) Some of the coefficient will approach zero but not absolute zero

C) Both A and B depending on the situation

D) None of these

In lasso, some of the coefficient value becomes zero, but in case of Ridge, the coefficients become close to zero but not zero.

14. What will happen when you apply a very large penalty in the case of Lasso? Ans: A

A) Some of the coefficients will become zero

B) Some of the coefficient will be approaching to zero but not absolute zero

C) Both A and B depending on the situation

D) None of these

As already discussed, lasso applies absolute penalty, so some of the coefficients will become zero.

15. What will happen when you fit degree 4 polynomial in linear regression? Ans: A

A) There are high chances that degree 4 polynomial will overfit the data

B) There are high chances that degree 4 polynomial will underfit the data

C) Can’t say

D) None of these

Since is more degree 4 will be more complex(overfit the data) than the degree 3 model so it will again perfectly fit the data. In such a case, training error will be zero but test error may not be zero.

16. What will happen when you fit degree 2 polynomial in linear regression? Ans: B

A) It is high chances that degree 2 polynomial will overfit the data

B) It is high chances that degree 2 polynomial will underfit the data

C) Can’t say

D) None of these

If a degree 3 polynomial fits the data perfectly, it’s highly likely that a simpler model(degree 2 polynomial) might underfit the data

17. In terms of bias and variance. Which of the following is true when you fit degree 2 polynomial? Ans: C

A) Bias will be high, variance will be high

B) Bias will be low, variance will be high

C) Bias will be high, variance will be low

D) Bias will be low, variance will be low

Since a degree 2 polynomial will be less complex as compared to degree 3, the bias will be high and variance will below.

18. What do you expect will happen with bias and variance as you increase the size of training data? Ans: D

A) Bias increases and Variance increases

B) Bias decreases and Variance increases

C) Bias decreases and Variance decreases

D) Bias increases and Variance decreases

E) Can’t Say False

As we increase the size of the training data, the bias would increase while the variance would decrease.

To view or add a comment, sign in

Linear Regression in MachineLearning

Sunil Kumar Cheruku

Cost Function (J):

Ordinary Least Squares (OLS):

Linear Regression Using Scikitlearn library:

Assumptions for linear regression:

How to choose the best regression model:

Stochastic Gradient Descent:

Real-World Cases:

Impact of Imbalanced Data:

Feature Engineering/Feature Transformation:

Feature Importance:

Impact of Missing Values:

Impact of Outliers:

Time and Space Complexity:

Interview Questions:

More articles by Sunil Kumar Cheruku

Others also viewed

How to use a correlation test before going for the regression analysis?

Different Types of Machine Learning: Part 2 – Supervised Learning: Linear Regression, Polynomial Regression, and Logistic Regression

CLASSIFICATION VS REGRESSION IN SUPERVISED LEARNING

Machine Learning series, Part -2

Machine Learning Models in 5 min without Math and Code.

Supervised Learning: Regression and Classification

Introduction to Supervised Learning