Linear Regression - Definition, Uses and Implementation

Merna Matta, PhD

Published Jan 19, 2021

Predictive analytics have become key players in the advancement of diverse industries owing to the potency in projecting future outcomes based on historical data and analytics techniques. Statistical models are the foundation of predictive analytics. They can be designed and developed to discover relationships between various behavior and patterns. For instance, they can help businesses attract, retain and nurture their valued customers. As well as, they can be used to detect and halt various types of criminal behavior before any serious damage is inflected. This article will be presenting the very basic and most commonly used model in predictive analytics named Linear Regression.

Definition

Linear regression, LR, examines the relationship between two variables by presenting a linear equation to observed data. One variable is considered to be a dependent variable or output variable Y, and the other is considered to be an explanatory variable or input variable X. The dependent variable Y is the focus of a question in a study or experiment while the explanatory variable X is one that explains changes in that Y. It can be anything that might affect Y, the response variable.

Now, the best-fitting straight (linear) line that de-scribes how an output Y changes as an input X changes is called a regression line.

Uses

Overall, Linear Regression is looking into two things:

Does a selection of predictor variables X-s valuable in predicting an outcome variable Y?
Which variables X particularly are significant predictors of the outcome Y, and in what way do they affect the outcome variable X?

Answering the questions above leads to three major uses for regression analysis: (1) determining the intensity of predictors, (2) forecasting an effect, and (3) trend forecasting. First, linear regression could be used to identify the strength of the effect that the independent variable X(s) have on Y, the outcome. For instance, what is the strength of the relationship between sales and marketing spending, age and income, or height and weight. Second, it might be used to forecast impact of changes. That is, the regression analysis helps us to understand how much the dependent variable Y changes with a change in one or more independent variables X. As par for the course, “ How much increase in sales I will gain when I spend $1000 more on social campaigns and advertising? ” Third, regression analysis predicts trends and future values. It can be used to find an approximate value, point estimates, or/ and confidence intervals. A typical question is, “ What will the price of Crude Oil be in 6 months? ”

Implementation

The libraries and the data set that will used are:

Pandas, which is a word blending of “panel data” and it is the most popular Python library for working with tabular data.
NumPy, a popular library for numerical computing. Mostly for reshape, arrange, and append arrays.
The %matplotlib for embedding matplotlib visualizations to directly in our Jupyter Notebook, which makes them easier to access and interpret.
Seaborn, another Python data visualization library used along with matplotlib for visualizing statistical relationships and visualizing distributions of data.

The dataset TDMB 5000 Movies was imported in csv format and it has a number of features, including:

The genres of the movie
The original language and the spoken language of the movie
The vote average of the movie

Understanding the data set

Using info() method to get very notable information about the data set. As for Movies, it's include 4803 entries, "homepage" has high amount of missing value which make it less valuable input for the machine learning model. Also, "tag line" has 25% missing info. 13 columns out of 20 have object values type.

Another useful way that you can learn about this data set is by generating histograms of the numerical data.

It's clear that "vote_average" has bell graph. Which indicates normal distribution, a symmetric probability distribution around the mean, which shows that data near the mean are more frequent in occurrence than data far from the mean.

Building a Machine Learning Linear Regression Model

Notice: before we start building the model, we must assure cleaned, prepared, and feature selected database. As for TDMB 5000 Movies there are more preparation and data selection has to be done before reaching the building stage which can be found in the attached link.

1. Splitting our data set into training data and test data

The first thing we need to do is to decide which columns will be used as features to make predictions and which column is the target that we are trying to predict (remember our total data is still features and target appended) .Afterwards, we split our data using scikit-learn into training data and test data. To do this, we’ll need to import the function train_test_split from the model_selection module of scikit-learn.

2. Building and training the model

After importing LinearRegression estimator from sklearn.linear_model we need to create Linear Regression object which assigned as "regvar" in this tutorial. Then, to train the model using scikit-learn’s fit method regvar.fit (X_train2, y_traint2).

You can examine the model coefficient using print(regvar.coef_). In the same way, you can see the intercept of the regression equation using print(regvar.intercept_)

3. Making prediction using our model

Now it's time to use our model for prediction. First, we call predict method on the model "regvar" that we created earlier. At this moment "y_pred_regvar" holds the predicted values of the features stored in X_test. Second, we compare the values of the predictions "y_pred_regvar" array with the real values that been held at y_test. An easy way to do this is plot the two arrays using a scatterplot.

4. Finally, tesing the model performance

there are three main performance metrics used for evaluating regression machine learning models:

Mean absolute error (MAE)

via metrics.mean_absolute_error(y_test, y_pred_regvar)

Mean squared error (MSE)

via metrics.mean_squared_error(y_test, y_pred_regvar)

Root mean squared error (RMSE)

via metrics.mean_squared_error(y_test, y_pred_regvar)

Final word, linear regression has other versions like simple LR and multiple LR which may necessitate using other techniques that might expand the implementation process. After all it depends on the dataset you are analyzing and the answers you are striving to reach out from your data.

Hanan Khamis, Ph.D. 5y

Nice read!

1 Reaction

Dr. Shoaib Kahut 柯浚枫 5y

It seems that Econometrics has totally been reshaped by the Python program