House price predictions with Python
When we are talking about the prediction of something in statistics/data science world one of the most famous model we are currently using is Linear Regression.
Linear Regression is supervised machine learning will predict real-value output from given set of the sample/input data.
This example is showing how to apply Linear Regression in a real-world example using Python.
Data set
I will use Boston dataset from Scikit-learn library.
First let's imports all required libraries. Including Pandas, Numpy, Matplotlib and Seaborn.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
Import boston dataset from Scikit Learn library, we can import boston dataset from the scikit-learn directly.
from sklearn.datasets import load_boston
boston = load_boston()
Convert boston dataset to Pandas dataframe.
boston_df = pd.DataFrame(boston.data)
# Convert index to column name using dataset feature names
boston_df.columns = boston.feature_names
Exploratory Data Analysis
Show sample data and check the data inside boston dataset.
boston_df.head()
# Show dataframe info, data types and field names
boston_df.info()
Features description:
Training a Linear Regression Model
You can see in the dataset only contains features of the house yet we do not have the price information. In order to create a prediction model we need to create X to store features and its values and y for target in this case the house price. Which store the boston target dataset.
X = boston_df y = boston.target
Train Test Split
Now let's split the data into a training set and a testing set. This is a common process when you working in Machine Learning project. In practical, we will split data into 2 groups 70% to be training set and the rest 30% to be test set.
In scikit-learn library it is quite easy to do so, we can use train_test_split function from model_selection module (or you can use cross_validation module)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.3, random_state=42)
Now you have training and test dataset to create a Linear Regression Model.
Creating and Training the Model
Import LinearRegression model from scikit-learn.
from sklearn.linear_model import LinearRegression
# Create new instance
lm = LinearRegression()
# Train/fit lm on the training data.
lm.fit(X_train, y_train)
Model Evaluation
Let's evaluate the model by checking out it's coefficients and how we can interpret them.
coeffecients = pd.DataFrame(lm.coef_,X.columns)
coeffecients.columns = ['Coeffecient']
coeffecients
From Coeffecient table above, we can make an assumption that,
If we increase the RM (Number of room) for 1 unit, the house price will increase $4,048
If the house bounds the Charles river, the price will increase $3,121
If the NOX (nitric oxides concentration) 1 unit, the price will down $15,469
If the distance getting far from boston working area, the price tend to decrease $1,386
Predictions from our Model
Now, let's predict our test dataset using model we have just created.
predictions = lm.predict(X_test)
#Create a scatterplot of the real test values versus the predicted values.
plt.scatter(y_test, predictions, s=5 )
plt.xlabel('Real Price')
plt.ylabel('Predicted Prices')
plt.title( "Real vs Predicted Housing Prices")
You can see that there are some predict value is far from the real price, in the bottom-left you can see that our model predict the negative price which is not possible in a real-world. The different between predicted and real value call Residuals
Let's check the model score of accuracy.
lm.score(X_test, y_test)
0.71092035863263514
Our model can predict accuracy 71%
Evaluate the prediction
# calculate these metrics by hand!
from sklearn import metrics
print('MSE:', metrics.mean_squared_error(y_test, predictions))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, predictions)))
Mean Square Error (MSE)
MSE is a measure of how close a fitted line is to data points. For every data point
MSE: 21.5402189439
Our MSE is about 21 which is good enough for a novice like me :-) .
Root Mean Square Error (RMSE)
RMSE is the root value of MSE, this is easy to understand as it show in y unit, int this project the house price (USD).
RMSE: 4.64114414169
Meaning that using our model to predict the house price it gives error around $4,641 of each predictions.
This is the sample project using Linear Regression with Python. For full code you can find it here : https://github.com/apiasak/DataScientist/blob/master/2%20Realworld%20Regression/Predicted%20Boston%20Housing%20Prices.ipynb
Note: Don't get confuse with the cover photo, it is actually Cambridge not Boston :P
Nice one Apipoj Piasak!