House price predictions with Python

Apipoj Piasak

Published Aug 6, 2017

When we are talking about the prediction of something in statistics/data science world one of the most famous model we are currently using is Linear Regression.

Linear Regression is supervised machine learning will predict real-value output from given set of the sample/input data.

This example is showing how to apply Linear Regression in a real-world example using Python.

Data set

I will use Boston dataset from Scikit-learn library.

First let's imports all required libraries. Including Pandas, Numpy, Matplotlib and Seaborn.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

Import boston dataset from Scikit Learn library, we can import boston dataset from the scikit-learn directly.

from sklearn.datasets import load_boston
boston = load_boston()

Convert boston dataset to Pandas dataframe.

boston_df = pd.DataFrame(boston.data)
# Convert index to column name using dataset feature names 
boston_df.columns = boston.feature_names

Exploratory Data Analysis

Show sample data and check the data inside boston dataset.

boston_df.head()

# Show dataframe info, data types and field names
boston_df.info()

Features description:

Training a Linear Regression Model

You can see in the dataset only contains features of the house yet we do not have the price information. In order to create a prediction model we need to create X to store features and its values and y for target in this case the house price. Which store the boston target dataset.

X = boston_df
y = boston.target

Train Test Split

Now let's split the data into a training set and a testing set. This is a common process when you working in Machine Learning project. In practical, we will split data into 2 groups 70% to be training set and the rest 30% to be test set.

In scikit-learn library it is quite easy to do so, we can use train_test_split function from model_selection module (or you can use cross_validation module)

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                      test_size=0.3, random_state=42)

Now you have training and test dataset to create a Linear Regression Model.

Creating and Training the Model

Import LinearRegression model from scikit-learn.

from sklearn.linear_model import LinearRegression

# Create new instance
lm = LinearRegression()

# Train/fit lm on the training data.
lm.fit(X_train, y_train)

Model Evaluation

Let's evaluate the model by checking out it's coefficients and how we can interpret them.

coeffecients = pd.DataFrame(lm.coef_,X.columns)
coeffecients.columns = ['Coeffecient']
coeffecients

From Coeffecient table above, we can make an assumption that,

If we increase the RM (Number of room) for 1 unit, the house price will increase $4,048

If the house bounds the Charles river, the price will increase $3,121

If the NOX (nitric oxides concentration) 1 unit, the price will down $15,469

If the distance getting far from boston working area, the price tend to decrease $1,386

Predictions from our Model

Now, let's predict our test dataset using model we have just created.

predictions = lm.predict(X_test)

#Create a scatterplot of the real test values versus the predicted values.
plt.scatter(y_test, predictions, s=5 )
plt.xlabel('Real Price')
plt.ylabel('Predicted Prices')
plt.title( "Real vs Predicted Housing Prices")

You can see that there are some predict value is far from the real price, in the bottom-left you can see that our model predict the negative price which is not possible in a real-world. The different between predicted and real value call Residuals

Let's check the model score of accuracy.

lm.score(X_test, y_test)

0.71092035863263514

Our model can predict accuracy 71%

Evaluate the prediction

# calculate these metrics by hand!
from sklearn import metrics

print('MSE:', metrics.mean_squared_error(y_test, predictions))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, predictions)))

Mean Square Error (MSE)

MSE is a measure of how close a fitted line is to data points. For every data point

MSE: 21.5402189439

Our MSE is about 21 which is good enough for a novice like me :-) .

Root Mean Square Error (RMSE)

RMSE is the root value of MSE, this is easy to understand as it show in y unit, int this project the house price (USD).

RMSE: 4.64114414169

Meaning that using our model to predict the house price it gives error around $4,641 of each predictions.

This is the sample project using Linear Regression with Python. For full code you can find it here : https://github.com/apiasak/DataScientist/blob/master/2%20Realworld%20Regression/Predicted%20Boston%20Housing%20Prices.ipynb

Note: Don't get confuse with the cover photo, it is actually Cambridge not Boston :P

Julien Zoma 8y

Nice one Apipoj Piasak!

1 Reaction

To view or add a comment, sign in

House price predictions with Python

Apipoj Piasak

Data set

Exploratory Data Analysis

Features description:

Training a Linear Regression Model

Train Test Split

Creating and Training the Model

Model Evaluation

Predictions from our Model

Evaluate the prediction

Mean Square Error (MSE)

Root Mean Square Error (RMSE)

More articles by Apipoj Piasak

Others also viewed

Introduction to Regression in Python with statsmodels

Introduction to exponential Smoothing for Time Series Forecasting using Python

Linear Regression in Python!

Python for Data Science: Leveraging Pandas, NumPy, and Matplotlib

My Day5 Python Learning: Deep Dive into NumPy

Logistic Regression in R and Python

Python Roadmap For Data Analysis

Getting Started with Statistical Analysis Using Python

Data Science "Scikit-Learn Cheat Sheet" for Python

Stock Market Prediction Using Python: Article 2 ( Smart curves )

Explore content categories

Data set

Exploratory Data Analysis

Features description:

Training a Linear Regression Model

Train Test Split

Creating and Training the Model

Model Evaluation

Predictions from our Model

Evaluate the prediction

Mean Square Error (MSE)

Root Mean Square Error (RMSE)

More articles by Apipoj Piasak

🌟 ตัวอย่าง Prompt ในการนำ Generative AI มาใช้ในองค์กรอย่างมีประสิทธิภาพสำหรับผู้บริหาร 🌟

Amazon ใช้ Big Data อย่างไรจนประสบความสำเร็จกลายเป็นร้านค้าออนไลน์ที่มีมูลค่ามากที่สุดในโลก

Data Analytics Leaders

Data Strategy

Data Science คืออะไร?

Predictive Customer Analytics

ประเภทของ Data Analytics

การทำ Data Governance ในองค์กร

วิธีการเขียน Business Story Telling

Using Python and Pandas to find the related movies

Others also viewed

Introduction to Regression in Python with statsmodels

Introduction to exponential Smoothing for Time Series Forecasting using Python

Linear Regression in Python!

Python for Data Science: Leveraging Pandas, NumPy, and Matplotlib

My Day5 Python Learning: Deep Dive into NumPy

Logistic Regression in R and Python

Python Roadmap For Data Analysis

Getting Started with Statistical Analysis Using Python

Data Science "Scikit-Learn Cheat Sheet" for Python

Stock Market Prediction Using Python: Article 2 ( Smart curves )

Similar topics

Linear Regression Models

How LLMs Generate Data-Rich Predictions

Machine Learning Models for Breast Cancer Risk Assessment

Explore content categories