Simple Linear Regression Model Using Python

Simple Linear Regression Model Using Python

Linear regression attempts to model the relationship between two variables by fitting a linear equation to observed data. One variable is considered to be an explanatory variable, and the other is considered to be a dependent variable. For example, a modeler might want to relate the weights of individuals to their heights using a linear regression model.

Before attempting to fit a linear model to observed data, a modeler should first determine whether or not there is a relationship between the variables of interest. This does not necessarily imply that one variable causes the other (for example, higher SAT scores do not cause higher college grades), but that there is some significant association between the two variables. A scatterplot can be a helpful tool in determining the strength of the relationship between two variables. If there appears to be no association between the proposed explanatory and dependent variables (i.e., the scatterplot does not indicate any increasing or decreasing trends), then fitting a linear regression model to the data probably will not provide a useful model. A valuable numerical measure of association between two variables is the correlation coefficient, which is a value between -1 and 1 indicating the strength of the association of the observed data for the two variables. A linear regression line has an equation of the form Y= a + bX, where X is the explanatory variable and Y is the dependent variable. The slope of the line is b, and a is the intercept (the value of y when x = 0). Let’s say, I give you the following puzzle:

Given the following values of X and Y, what is the value of Y when X = 5.

(1, 1), (2, 2), (4, 4), (100,100), (20, 20)

The answer is: 5. Not very difficult, right?

Now, let’s take a look at different example. Say you

have the following pairs of X and Y. Can you calculate the value of Y, when X =5?

(1, 1), (2, 4), (4, 16), (100,10000), (20, 400)

The answer is: 25. Was it difficult?

Let’s understand a bit as to what happened in the

above examples. When we look at the first example, after look at the given pairs, one can establish that the relationship between X and Y is Y = X. Similarly, in the second example, the relationship is Y = X*X.

In these two examples, we can determine the relationship between two given variables (X and Y) because we could easily

identify the relationship between them. Overall, machine learning works in the same way.

Let’s try to see it practically.

Suppose we have given a csv file having salary data according to year of experience. We want to predict salary according to Number of experience.

Here our dependent variable would be salary and independent variable would be number of year of experience.

Below is the 30 data point of Salary vs Experience and we will be building simple Linear Regression model to predict Salary.

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

 

dataset=pd.read_csv("Salary_Data.csv")

X=dataset.iloc[:,:-1].values

y=dataset.iloc[:,1].values

 
  

Assume above given data is present in Salary_Data.csv file. First we load data using pandas in data frame. X is our Salary data set and Y is our Number of year of Experience data set


Now we will bifurcate these data into training set and test set. Training set data will be used by machine to learn and build a model while test data set will be used to test or predict the outcome.To split this data into training set we will use sklearn module.

from sklearn.cross_validation import train_test_split

X_train,X_test,y_train,y_test=train_test_split(X,y, test_size=0.3,random_state=0)

 
  

Since our csv file contains 30 data point we want training set to be of 20 data point and test set to be 10 data point that’s why we have given test_size=0.3 i.e. split data such as test set gets 1/3rd of data.

Now splitting of data is complete we will use linear regression model to train of algorithm and predict out come.

from sklearn.linear_model import LinearRegression

regressor = LinearRegression()

regressor.fit(X_train, y_train)

# Predicting the Test set results

y_pred = regressor.predict(X_test)

y_pred contains the prediction for our X_test dataset

We will plot this prediction on graph using matplotlib module to see whether Training set and prediction are following Linear Regression.

# Visualising the Training set results

plt.scatter(X_train, y_train, color = 'red')

plt.plot(X_train, regressor.predict(X_train), color = 'blue')

plt.title('Salary vs Experience (Training set)')

plt.xlabel('Years of Experience')

plt.ylabel('Salary')

plt.show()

 
  

Now we will apply our test data set and plot a graph to check what prediction is coming.

# Visualising the Test set results

plt.scatter(X_test, y_test, color = 'red')

plt.plot(X_train, regressor.predict(X_train), color = 'blue')

plt.title('Salary vs Experience (Test set)')

plt.xlabel('Years of Experience')

plt.ylabel('Salary')

plt.show()

 
  

Red dot is our test data and blue line is our prediction which is matching for most of the cases. For couple of data point our prediction is far away from the actual data set.

Very informative article.. Thanks for sharing it

To view or add a comment, sign in

More articles by Shishir Dwivedi

  • Avoiding Pitfalls of A/B Testing

    In recent years the use of online A/B testing has skyrocketed, fueled by a growing appreciation of its value and…

  • Introduction To MicroServices Part I

    What is Microservices? Microservices - also known as the microservice architecture - is an architectural style that…

  • Introduction To Apache Kafka

    Introduction Kafka is a word that gets heard a lot nowadays… A lot of leading digital companies seem to use it as well.…

    1 Comment
  • Continuous Integration with Jenkins and GitLab

    recently got tasked with setting up a new Jenkins box within my organization, and having it work with our GitLab hosted…

  • How to Split Test Traffic using NGINX

    A/B Testing, has enabled designers and product managers to get a deep insight into user behavioral patterns. On the one…

    1 Comment
  • How Recursion Works

    Recursion can be tough to understand — especially for new programmers. In it’s simplest form, a recursive function is…

  • Nginx Tutorial: Basic Concepts Part I

    Nginx was originally created as a web server to solve the C10k problem. And as a web server, it can serve your data…

  • Building RESTful APIs with Tornado

    Tornado is a Python Web framework and asynchronous networking library that provides excellent scalability due to its…

    2 Comments
  • Automation Using Selenium 3.0

    Firefox 47 + version uses Marionette proxy therefore to automate application of firefox requires Gecko driver. Lets try…

Others also viewed

Explore content categories