Supervised Learning: Linear Regression
https://analyticsindiamag.com/beginners-guide-to-linear-regression-in-python/

Supervised Learning: Linear Regression

Linear regression is a fundamental statistical technique in machine learning and data analysis that models the relationship between a dependent variable and one or more independent variables. It’s a go-to method for predictive modeling, thanks to its simplicity and interpretability. In this blog, we’ll dive deep into the concepts, implementation, and practical applications of linear regression.

What is Linear Regression?

At its core, linear regression aims to fit a straight line through a set of data points in such a way that the line best represents the data. The equation of a simple linear regression line is:

𝑦=𝛽0+𝛽1𝑥+𝜖y=β0+β1x+ϵ

  • 𝑦y: Dependent variable (response)
  • 𝑥x: Independent variable (predictor)
  • 𝛽0β0: Intercept (the value of 𝑦y when 𝑥x is 0)
  • 𝛽1β1: Slope (the change in 𝑦y for a one-unit change in 𝑥x)
  • 𝜖ϵ: Error term (the difference between the observed and predicted values)

In multiple linear regression, where there are multiple predictors, the equation extends to:

𝑦=𝛽0+𝛽1𝑥1+𝛽2𝑥2+…+𝛽𝑛𝑥𝑛+𝜖y=β0+β1x1+β2x2+…+βnxn+ϵ

Here, each 𝑥𝑖xi represents an independent variable, and each 𝛽𝑖βi represents the corresponding coefficient.

Assumptions of Linear Regression

For linear regression to be effective, certain assumptions must be met:

  1. Linearity: The relationship between the dependent and independent variables should be linear.
  2. Independence: Observations should be independent of each other.
  3. Homoscedasticity: The residuals (errors) should have constant variance at every level of 𝑥x.
  4. Normality: The residuals should be normally distributed.

Violations of these assumptions can lead to unreliable results.

Implementing Linear Regression in Python

Python provides robust libraries like scikit-learn for implementing linear regression. Here’s a step-by-step guide:

Import Libraries:

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt
        


Load Data:

Assume we have a dataset data.csv with columns YearsExperience (independent variable) and Salary (dependent variable).

data = pd.read_csv('data.csv')
X = data[['YearsExperience']]
y = data['Salary']
        

Split Data:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)        

Train Model:

model = LinearRegression()
model.fit(X_train, y_train)
        

Make Predictions:

y_pred = model.predict(X_test)
        

Evaluate Model:

mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f'Mean Squared Error: {mse}')
print(f'R-squared: {r2}')
        

Visualize Results:

plt.scatter(X_test, y_test, color='blue', label='Actual')
plt.plot(X_test, y_pred, color='red', linewidth=2, label='Predicted')
plt.xlabel('Years of Experience')
plt.ylabel('Salary')
plt.legend()
plt.show()        

Interpreting the Results

  • Coefficients: The coefficients 𝛽0β0 and 𝛽1β1 provide insights into the relationship between the variables. For example, if 𝛽1β1 is 5000, it indicates that each additional year of experience is associated with a $5000 increase in salary.
  • R-squared: This value ranges from 0 to 1 and indicates how well the independent variables explain the variability of the dependent variable. An 𝑅2R2 of 0.8 means that 80% of the variability in salary can be explained by years of experience.
  • Residuals: Analyzing residuals helps check the assumptions of homoscedasticity and normality. Patterns in residual plots suggest model issues.

Applications of Linear Regression

Linear regression is widely used across various domains:

  • Economics: Predicting economic indicators like GDP, inflation rates, etc.
  • Finance: Estimating stock prices, risk management, and portfolio optimization.
  • Marketing: Analyzing the impact of advertising spend on sales.
  • Healthcare: Predicting patient outcomes based on medical history.


One of the most common and practical applications of linear regression is predicting housing prices. The real estate industry relies heavily on accurate price predictions to make informed decisions. Let's walk through a detailed use case to understand how linear regression can be applied to this problem.

Problem Statement

We want to predict the price of houses based on various features such as the size of the house, number of bedrooms, location, and age of the house. The goal is to build a model that can accurately predict the price of a house given these features.

Dataset

Assume we have a dataset named housing.csv with the following columns:

  • Size: Square footage of the house
  • Bedrooms: Number of bedrooms
  • Age: Age of the house in years
  • Location: Categorical variable indicating the neighborhood
  • Price: Price of the house

Here's a snippet of what the dataset might look like:

Article content

Steps to Build the Model

  1. Data Preprocessing:
  2. Splitting the Data:
  3. Training the Model:
  4. Evaluating the Model:
  5. Making Predictions:

Let's implement these steps in Python using pandas and scikit-learn.

Implementation

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import OneHotEncoder

# Step 1: Load and preprocess the data
data = pd.read_csv('housing.csv')

# Handle categorical variable: Location
data = pd.get_dummies(data, columns=['Location'], drop_first=True)

# Split features and target variable
X = data.drop('Price', axis=1)
y = data['Price']

# Step 2: Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 3: Train the model
model = LinearRegression()
model.fit(X_train, y_train)

# Step 4: Evaluate the model
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f'Mean Squared Error: {mse}')
print(f'R-squared: {r2}')

# Step 5: Making predictions
new_house = pd.DataFrame({
    'Size': [2500],
    'Bedrooms': [3],
    'Age': [15],
    'Location_Suburb': [1],  # Assuming the new house is in the suburb
    'Location_City Center': [0]  # No need to explicitly include this column as it will be handled by pandas.get_dummies with drop_first=True
})

predicted_price = model.predict(new_house)
print(f'Predicted Price: ${predicted_price[0]:.2f}')
        

Interpreting the Results

  • Mean Squared Error (MSE): This metric tells us how close the predicted prices are to the actual prices. A lower MSE indicates a better fit.
  • R-squared: This metric indicates the proportion of the variance in the dependent variable that is predictable from the independent variables. An R-squared value closer to 1 implies a better fit.

Business Impact

Using a linear regression model to predict housing prices can significantly benefit various stakeholders:

  1. Real Estate Agents: Can provide clients with accurate price estimates, helping them make informed decisions.
  2. Home Buyers and Sellers: Can use the predictions to understand market value and negotiate better deals.
  3. Investors: Can identify undervalued properties and potential investment opportunities.
  4. Banks and Mortgage Lenders: Can assess property values more accurately, aiding in loan approvals and risk management.


Linear regression is a powerful and intuitive tool for predictive modeling and data analysis. While simple to implement, it provides valuable insights and a strong foundation for more complex machine learning techniques. By understanding its assumptions and applications, you can leverage linear regression to make informed decisions and predictions in various fields.


Author

Nadir Riyani is an accomplished and visionary Engineering Manager specialising in AI/ML technologies. With a wealth of experience leading high-performing engineering teams, Nadir is passionate about leveraging artificial intelligence and machine learning to drive innovation and solve complex challenges. His expertise spans across software development principles, encompassing Agile, Automation and DevOps methodologies. Nadir's commitment to engineering excellence and ability to align technical strategies with business objectives make him a valuable asset to any organization. For further inquiries, please feel free to reach out to him at riyaninadir@gmail.com.

To view or add a comment, sign in

More articles by Nadir R.

Others also viewed

Explore content categories