Supervised Learning: Linear Regression

Nadir R.

Published May 26, 2024

Linear regression is a fundamental statistical technique in machine learning and data analysis that models the relationship between a dependent variable and one or more independent variables. It’s a go-to method for predictive modeling, thanks to its simplicity and interpretability. In this blog, we’ll dive deep into the concepts, implementation, and practical applications of linear regression.

What is Linear Regression?

At its core, linear regression aims to fit a straight line through a set of data points in such a way that the line best represents the data. The equation of a simple linear regression line is:

𝑦=𝛽0+𝛽1𝑥+𝜖y=β0+β1x+ϵ

𝑦y: Dependent variable (response)
𝑥x: Independent variable (predictor)
𝛽0β0: Intercept (the value of 𝑦y when 𝑥x is 0)
𝛽1β1: Slope (the change in 𝑦y for a one-unit change in 𝑥x)
𝜖ϵ: Error term (the difference between the observed and predicted values)

In multiple linear regression, where there are multiple predictors, the equation extends to:

𝑦=𝛽0+𝛽1𝑥1+𝛽2𝑥2+…+𝛽𝑛𝑥𝑛+𝜖y=β0+β1x1+β2x2+…+βnxn+ϵ

Here, each 𝑥𝑖xi represents an independent variable, and each 𝛽𝑖βi represents the corresponding coefficient.

Assumptions of Linear Regression

For linear regression to be effective, certain assumptions must be met:

Linearity: The relationship between the dependent and independent variables should be linear.
Independence: Observations should be independent of each other.
Homoscedasticity: The residuals (errors) should have constant variance at every level of 𝑥x.
Normality: The residuals should be normally distributed.

Violations of these assumptions can lead to unreliable results.

Implementing Linear Regression in Python

Python provides robust libraries like scikit-learn for implementing linear regression. Here’s a step-by-step guide:

Import Libraries:

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt

Load Data:

Assume we have a dataset data.csv with columns YearsExperience (independent variable) and Salary (dependent variable).

data = pd.read_csv('data.csv')
X = data[['YearsExperience']]
y = data['Salary']

Split Data:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Train Model:

model = LinearRegression()
model.fit(X_train, y_train)

Make Predictions:

y_pred = model.predict(X_test)

Evaluate Model:

mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f'Mean Squared Error: {mse}')
print(f'R-squared: {r2}')

Visualize Results:

Recommended by LinkedIn

An Introduction to Support Vector Regression (SVR) in…

Nandini Verma 2 years ago

Understanding Linear Regression: The Foundation of…

Vinisha Sivakumar 6 months ago

A Beginner’s Guide to Linear Regression in Machine…

Rajan Bhatt 1 year ago

plt.scatter(X_test, y_test, color='blue', label='Actual')
plt.plot(X_test, y_pred, color='red', linewidth=2, label='Predicted')
plt.xlabel('Years of Experience')
plt.ylabel('Salary')
plt.legend()
plt.show()

Interpreting the Results

Coefficients: The coefficients 𝛽0β0 and 𝛽1β1 provide insights into the relationship between the variables. For example, if 𝛽1β1 is 5000, it indicates that each additional year of experience is associated with a $5000 increase in salary.
R-squared: This value ranges from 0 to 1 and indicates how well the independent variables explain the variability of the dependent variable. An 𝑅2R2 of 0.8 means that 80% of the variability in salary can be explained by years of experience.
Residuals: Analyzing residuals helps check the assumptions of homoscedasticity and normality. Patterns in residual plots suggest model issues.

Applications of Linear Regression

Linear regression is widely used across various domains:

Economics: Predicting economic indicators like GDP, inflation rates, etc.
Finance: Estimating stock prices, risk management, and portfolio optimization.
Marketing: Analyzing the impact of advertising spend on sales.
Healthcare: Predicting patient outcomes based on medical history.

One of the most common and practical applications of linear regression is predicting housing prices. The real estate industry relies heavily on accurate price predictions to make informed decisions. Let's walk through a detailed use case to understand how linear regression can be applied to this problem.

Problem Statement

We want to predict the price of houses based on various features such as the size of the house, number of bedrooms, location, and age of the house. The goal is to build a model that can accurately predict the price of a house given these features.

Dataset

Assume we have a dataset named housing.csv with the following columns:

Size: Square footage of the house
Bedrooms: Number of bedrooms
Age: Age of the house in years
Location: Categorical variable indicating the neighborhood
Price: Price of the house

Here's a snippet of what the dataset might look like:

Steps to Build the Model

Data Preprocessing:
Splitting the Data:
Training the Model:
Evaluating the Model:
Making Predictions:

Let's implement these steps in Python using pandas and scikit-learn.

Implementation

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import OneHotEncoder

# Step 1: Load and preprocess the data
data = pd.read_csv('housing.csv')

# Handle categorical variable: Location
data = pd.get_dummies(data, columns=['Location'], drop_first=True)

# Split features and target variable
X = data.drop('Price', axis=1)
y = data['Price']

# Step 2: Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 3: Train the model
model = LinearRegression()
model.fit(X_train, y_train)

# Step 4: Evaluate the model
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f'Mean Squared Error: {mse}')
print(f'R-squared: {r2}')

# Step 5: Making predictions
new_house = pd.DataFrame({
    'Size': [2500],
    'Bedrooms': [3],
    'Age': [15],
    'Location_Suburb': [1],  # Assuming the new house is in the suburb
    'Location_City Center': [0]  # No need to explicitly include this column as it will be handled by pandas.get_dummies with drop_first=True
})

predicted_price = model.predict(new_house)
print(f'Predicted Price: ${predicted_price[0]:.2f}')

Interpreting the Results

Mean Squared Error (MSE): This metric tells us how close the predicted prices are to the actual prices. A lower MSE indicates a better fit.
R-squared: This metric indicates the proportion of the variance in the dependent variable that is predictable from the independent variables. An R-squared value closer to 1 implies a better fit.

Business Impact

Using a linear regression model to predict housing prices can significantly benefit various stakeholders:

Real Estate Agents: Can provide clients with accurate price estimates, helping them make informed decisions.
Home Buyers and Sellers: Can use the predictions to understand market value and negotiate better deals.
Investors: Can identify undervalued properties and potential investment opportunities.
Banks and Mortgage Lenders: Can assess property values more accurately, aiding in loan approvals and risk management.

Linear regression is a powerful and intuitive tool for predictive modeling and data analysis. While simple to implement, it provides valuable insights and a strong foundation for more complex machine learning techniques. By understanding its assumptions and applications, you can leverage linear regression to make informed decisions and predictions in various fields.

Author

Nadir Riyani is an accomplished and visionary Engineering Manager specialising in AI/ML technologies. With a wealth of experience leading high-performing engineering teams, Nadir is passionate about leveraging artificial intelligence and machine learning to drive innovation and solve complex challenges. His expertise spans across software development principles, encompassing Agile, Automation and DevOps methodologies. Nadir's commitment to engineering excellence and ability to align technical strategies with business objectives make him a valuable asset to any organization. For further inquiries, please feel free to reach out to him at riyaninadir@gmail.com.

To view or add a comment, sign in

Supervised Learning: Linear Regression

Nadir R.

What is Linear Regression?

Assumptions of Linear Regression

Implementing Linear Regression in Python

Recommended by LinkedIn

Interpreting the Results

Applications of Linear Regression

Problem Statement

Dataset

Steps to Build the Model

Implementation

Interpreting the Results

Business Impact

More articles by Nadir R.

Others also viewed

Regression Metrics - Of all metrics why MSE?

Supervised Learning Algorithms

Statistics and machine learning approaches to problem solving

understanding the logistic regression model in layman's words

Deep dive into Logistic Regression and Regularization

What is Linear Regression? Towards Deep Learnings

K-Means Clustering

K-means Clustering in Machine Learning

K-MEANS CLUSTERING

K-Means Clustering

Linear Regression Models

Integrating Machine Learning In Engineering Data Analysis

Best Practices For Evaluating Predictive Analytics Models

Machine Learning Models For Healthcare Predictive Analytics

Machine Learning Models for Breast Cancer Risk Assessment

Understanding Model Drift In Machine Learning Applications

Explore content categories

What is Linear Regression?

Assumptions of Linear Regression

Implementing Linear Regression in Python

Recommended by LinkedIn

Interpreting the Results

Applications of Linear Regression

Problem Statement

Dataset

Steps to Build the Model

Implementation

Interpreting the Results

Business Impact

More articles by Nadir R.

Claude Code Commands

Claude Code Deep Dive: Connectors, Skills, and Plugins Explained

How I Built a Jira Productivity Analyzer in 2 Hours — Without Writing a Single Line of Code

Google A2A Agents: The Future of Multi-Agent Collaboration

Atlassian JIRA MCP: Enabling Intelligent Project Automation with Model Context Protocol

Atlassian Confluence MCP with Copilot: AI-Powered Knowledge Management

Introducing “Log Pilot” – An AI Observability Agent for Smarter Logging and Monitoring

🚀 AI Agent for Automated Notification Code Generation

TOON – Token Oriented Object Notation: A Lightweight Format for Structured Data

Engineering Culture First: Your Long-Term Scaling Engine

Others also viewed

Regression Metrics - Of all metrics why MSE?

Supervised Learning Algorithms

Statistics and machine learning approaches to problem solving

understanding the logistic regression model in layman's words

Deep dive into Logistic Regression and Regularization

What is Linear Regression? Towards Deep Learnings

K-Means Clustering

K-means Clustering in Machine Learning

K-MEANS CLUSTERING

K-Means Clustering

Similar topics

Linear Regression Models

Integrating Machine Learning In Engineering Data Analysis

Best Practices For Evaluating Predictive Analytics Models

Machine Learning Models For Healthcare Predictive Analytics

Machine Learning Models for Breast Cancer Risk Assessment

Understanding Model Drift In Machine Learning Applications

Explore content categories