Linear Regression - Definition, Uses and Implementation
by Merna Matta

Linear Regression - Definition, Uses and Implementation

Predictive analytics have become key players in the advancement of diverse industries owing to the potency in projecting future outcomes based on historical data and analytics techniques. Statistical models are the foundation of predictive analytics. They can be designed and developed to discover relationships between various behavior and patterns. For instance, they can help businesses attract, retain and nurture their valued customers. As well as, they can be used to detect and halt various types of criminal behavior before any serious damage is inflected. This article will be presenting the very basic and most commonly used model in predictive analytics named Linear Regression.


Definition

Linear regression, LR, examines the relationship between two variables by presenting a linear equation to observed data. One variable is considered to be a dependent variable or output variable Y, and the other is considered to be an explanatory variable or input variable X. The dependent variable Y is the focus of a question in a study or experiment while the explanatory variable X is one that explains changes in that Y. It can be anything that might affect Y, the response variable.

No alt text provided for this image

Now, the best-fitting straight (linear) line that de-scribes how an output Y changes as an input X changes is called a regression line.




Uses 

Overall, Linear Regression is looking into two things:

  1. Does a selection of predictor variables X-s valuable in predicting an outcome variable Y? 
  2. Which variables X particularly are significant predictors of the outcome Y, and in what way do they affect the outcome variable X? 

Answering the questions above leads to three major uses for regression analysis: (1) determining the intensity of predictors, (2) forecasting an effect, and (3) trend forecasting. First, linear regression could be used to identify the strength of the effect that the independent variable X(s) have on Y, the outcome. For instance, what is the strength of the relationship between sales and marketing spending, age and income, or height and weight. Second, it might be used to forecast impact of changes. That is, the regression analysis helps us to understand how much the dependent variable Y changes with a change in one or more independent variables X. As par for the course, “ How much increase in sales I will gain when I spend $1000 more on social campaigns and advertising? ” Third, regression analysis predicts trends and future values. It can be used to find an approximate value, point estimates, or/ and confidence intervals. A typical question is, “ What will the price of Crude Oil be in 6 months? ”


Implementation

No alt text provided for this image

The libraries and the data set that will used are:

The dataset TDMB 5000 Movies was imported in csv format and it has a number of features, including:

  • The genres of the movie
  • The original language and the spoken language of the movie
  • The vote average of the movie

Understanding the data set


No alt text provided for this image
  • Using info() method to get very notable information about the data set. As for Movies, it's include 4803 entries, "homepage" has high amount of missing value which make it less valuable input for the machine learning model. Also, "tag line" has 25% missing info. 13 columns out of 20 have object values type.


  • Another useful way that you can learn about this data set is by generating histograms of the numerical data.
histograms tdmb movies

It's clear that "vote_average" has bell graph. Which indicates normal distribution, a symmetric probability distribution around the mean, which shows that data near the mean are more frequent in occurrence than data far from the mean.


Building a Machine Learning Linear Regression Model

Notice: before we start building the model, we must assure cleaned, prepared, and feature selected database. As for TDMB 5000 Movies there are more preparation and data selection has to be done before reaching the building stage which can be found in the attached link.

data set

1. Splitting our data set into training data and test data

The first thing we need to do is to decide which columns will be used as features to make predictions and which column is the target that we are trying to predict (remember our total data is still features and target appended) .Afterwards, we split our data using scikit-learn into training data and test data. To do this, we’ll need to import the function train_test_split from the model_selection module of scikit-learn.

train_test_split

2. Building and training the model

No alt text provided for this image

After importing LinearRegression estimator from sklearn.linear_model we need to create Linear Regression object which assigned as "regvar" in this tutorial. Then, to train the model using scikit-learn’s fit method regvar.fit (X_train2, y_traint2).

You can examine the model coefficient using print(regvar.coef_). In the same way, you can see the intercept of the regression equation using print(regvar.intercept_)

No alt text provided for this image
No alt text provided for this image

3. Making prediction using our model

Now it's time to use our model for prediction. First, we call predict method on the model "regvar" that we created earlier. At this moment "y_pred_regvar" holds the predicted values of the features stored in X_test. Second, we compare the values of the predictions "y_pred_regvar" array with the real values that been held at y_test. An easy way to do this is plot the two arrays using a scatterplot.

No alt text provided for this image

4. Finally, tesing the model performance

there are three main performance metrics used for evaluating regression machine learning models:

  • Mean absolute error (MAE)

via metrics.mean_absolute_error(y_test, y_pred_regvar)

  • Mean squared error (MSE)

via metrics.mean_squared_error(y_test, y_pred_regvar)

  • Root mean squared error (RMSE)

via metrics.mean_squared_error(y_test, y_pred_regvar)

Final word, linear regression has other versions like simple LR and multiple LR which may necessitate using other techniques that might expand the implementation process. After all it depends on the dataset you are analyzing and the answers you are striving to reach out from your data.
























 

To view or add a comment, sign in

Others also viewed

Explore content categories