Predicting Property Prices: a simple Machine Learning Linear Regression Model on Databricks and SparkML

M E Cizniar (Togaf, Dama)

Published May 20, 2020

In this article, I will show how to predict property prices using DataBricks. I will pick on a previous experiment (1) that used Spark proper and a dataset well known of Data Scientists, the famous Boston Housing dataset (2).

1.The source dataset

First not-so-good surprise, the dataset is not a csv file, but a webpage formatted for human reading, not computer processing!

The property price is the last MEDV column. We are going to try to predict that column, using the previous 12 parameters (CRIM to LSTAT).

Had it been a big file, the first task would have been to data-engineer the data. fortunately, it has only 506 lines of data, and I find quicker to reformat it manually using Notepad++. The result is as follows, much more usable.

Excessive line feeds have been removed, so have redundant spaces, the fields have been separated by tabs and a header provided, replacing the original explanation text block. This file is saved as boston_housing_dataset.csv and is now ready to by used. I am uploading it in the Azure datalake from my laptop using Azure Storage Explorer, in the lab-mec folder under the raw blob container

Now that we have our source data ready and loaded, it's time to switch to Databricks.

First, let's mount the datalake container "raw" into "/mnt/raw" in dbfs.

Let's confirm that our boston_housing_dataset.csv file can be found in dbfs in the /mnt/raw/lab-mec folder where we put it

All good so far.

2. Preparing the data for the ML model

Let's now load the file in a dataframe "data". We ask Databricks to use tab as a delimiter, to consider the first line as a header, and to infer the schema.

Let's display the data in this dataframe "data":

All the fields are there, as we expect them.

We now need to divide the columns into 2 groups, Features and Labels. Features is the data that the prediction will be based on, Labels is the result of the prediction.

In our example, the features are the columns from CRMZN to LSTAT, the Label is the last column MEDV which contains the property price.

What we are trying to achieve is to predict the Labels from the Features.

To create a features array, we use the VectorAssembler class

First, we import it,

Then we pass it the list of column names. Here we pass all the columns, except the last one. The Output column "features" is the result of assembling the feature columns.

Now we put VectorAssembler into action using the "transform" method, and load the result into the dataframe data_2

If we display data_2, we can see that an extra column "features" was added to the dataset, combining the values of the first 12 columns.

Now we are going to split the rows of the dataset into 2 subsets, "train" and "test", with a proportion of respectively 70% and 30%. The train subset will be used to train the model, the test one to predict the labels from the features.

The training of the Model

It's now time to train our model on our data.

First we import the LinearRegression class

Then we define the variable "lr" that will receive the result of the linar regression between the feature column "features" and the label one "MEDV"

We now call the fit method on the lr variable for the train dataset

We can now evaluate the model's performance on the test dataset using the evaluate method

The evaluation_summary properties enable us to access a number of metrics

House Prices predictions

Time now to make our Predictions. Let's call the transform method on the test dataset

prediction is a dataset containing the original columns, the features column, and a new predictions column

Let's focus on the MEDV column (the original price) and the prediction one (the predicted price)

How good are our price predictions? Let's display the MEDV, prediction data as a scatter cloud

Not too bad, eh?

Analysis of the graph (and of the regression model)

So what does the graph tell us? Well, that our regression model isn't that bad! For an average real property price of $15k (x-axis), the predicted price is also around $15k (y-axis). For $10k, it is slightly above ($10-15k) but for a real MEDV of $5k the predicted price is again centred around the $5k mark.

"A property for $15k???" Don't dream guys, remember that the dataset dates from 1978... Those were the days...

This is quite remarkable because remember that the 156 points above are part of the test dataset (the 30% of the 506 rows that we chose randomly not to be part of the training) so basically they have never "seen" the MEDV figures before... Their y-axis value is uniquely extrapolated from the 12 parameters ranging from Crime Rate to Percentage of lower status of the population.

And to start with, the remaining 350 rows the model was trained on, is really a small sample...

Of course there are some outliers, for instance one of the points has a real median price (x-axis) of 3.54 but a predicted value (y-axis) of -3.23, which means that you actually would have to pay me to buy there... But perhaps it is just simply what you need to do? lol

Another outlier has a real price of $16.14k and an estimated value $22.34k. I would definitely pay a visit to the area, could it be massively underrated? Could there be opportunities of very good deals and massive profits?

Conclusion

Instead of the Boston dataset, we of course want to make predictions on real property prices in the UK, using features like location, type of property, number of bedrooms, etc, but also crime rates, proximity to a station, a supermarket, a school...

References:

(1) https://towardsdatascience.com/apache-spark-mllib-tutorial-ec6f1cb336a9

(2) http://lib.stat.cmu.edu/datasets/boston

To view or add a comment, sign in

Predicting Property Prices: a simple Machine Learning Linear Regression Model on Databricks and SparkML

M E Cizniar (Togaf, Dama)

More articles by M E Cizniar (Togaf, Dama)

Others also viewed

Part 1: Getting Started with Genie

Why dbt Still Makes Sense in Databricks

Build a real-time recommendation API on Azure

Lecture 9: Introduction to Monitoring

🚀 End-to-End Databricks & Spark Project #2: Polishing Data with Silver and Gold Layers

Scaling Delta Lake: When Metadata Grows Faster Than Data

Data Science Quick Tip #003: Using Scikit-Learn Pipelines!

Databricks for Real-Time Data Insights: A True Advantage in a Fast-Paced Market

This is what I really do as a Data Scientist

INCREMENTAL DATA LOAD DATABRICKS

Explore content categories

More articles by M E Cizniar (Togaf, Dama)

Change Management in Data & Analytics

DAT207x: Analyzing and Visualizing Data with Power BI

DAT224x: Developing a SQL Server Analysis Services Multidimensional DataModel

DAT225x: Developing a SQL Server Analysis Services Tabular Data Model

DAT217x: Implementing ETL with SQL Server Integration Services

DAT263x: Introduction to Artificial Intelligence (AI)

DAT101x: Introduction to Data Science

DAT229x: Microsoft Professional Orientation : Big Data

DAT216x: Delivering a Relational Data Warehouse

My vision of BI: I can see Alexa and her sisters coming, and fast.