Predicting Property Prices: a simple Machine Learning Linear Regression Model on Databricks and SparkML
In this article, I will show how to predict property prices using DataBricks. I will pick on a previous experiment (1) that used Spark proper and a dataset well known of Data Scientists, the famous Boston Housing dataset (2).
1.The source dataset
First not-so-good surprise, the dataset is not a csv file, but a webpage formatted for human reading, not computer processing!
The property price is the last MEDV column. We are going to try to predict that column, using the previous 12 parameters (CRIM to LSTAT).
Had it been a big file, the first task would have been to data-engineer the data. fortunately, it has only 506 lines of data, and I find quicker to reformat it manually using Notepad++. The result is as follows, much more usable.
Excessive line feeds have been removed, so have redundant spaces, the fields have been separated by tabs and a header provided, replacing the original explanation text block. This file is saved as boston_housing_dataset.csv and is now ready to by used. I am uploading it in the Azure datalake from my laptop using Azure Storage Explorer, in the lab-mec folder under the raw blob container
Now that we have our source data ready and loaded, it's time to switch to Databricks.
First, let's mount the datalake container "raw" into "/mnt/raw" in dbfs.
Let's confirm that our boston_housing_dataset.csv file can be found in dbfs in the /mnt/raw/lab-mec folder where we put it
All good so far.
2. Preparing the data for the ML model
Let's now load the file in a dataframe "data". We ask Databricks to use tab as a delimiter, to consider the first line as a header, and to infer the schema.
Let's display the data in this dataframe "data":
All the fields are there, as we expect them.
We now need to divide the columns into 2 groups, Features and Labels. Features is the data that the prediction will be based on, Labels is the result of the prediction.
In our example, the features are the columns from CRMZN to LSTAT, the Label is the last column MEDV which contains the property price.
What we are trying to achieve is to predict the Labels from the Features.
To create a features array, we use the VectorAssembler class
First, we import it,
Then we pass it the list of column names. Here we pass all the columns, except the last one. The Output column "features" is the result of assembling the feature columns.
Now we put VectorAssembler into action using the "transform" method, and load the result into the dataframe data_2
If we display data_2, we can see that an extra column "features" was added to the dataset, combining the values of the first 12 columns.
Now we are going to split the rows of the dataset into 2 subsets, "train" and "test", with a proportion of respectively 70% and 30%. The train subset will be used to train the model, the test one to predict the labels from the features.
The training of the Model
It's now time to train our model on our data.
First we import the LinearRegression class
Then we define the variable "lr" that will receive the result of the linar regression between the feature column "features" and the label one "MEDV"
We now call the fit method on the lr variable for the train dataset
We can now evaluate the model's performance on the test dataset using the evaluate method
The evaluation_summary properties enable us to access a number of metrics
House Prices predictions
Time now to make our Predictions. Let's call the transform method on the test dataset
prediction is a dataset containing the original columns, the features column, and a new predictions column
Let's focus on the MEDV column (the original price) and the prediction one (the predicted price)
How good are our price predictions? Let's display the MEDV, prediction data as a scatter cloud
Not too bad, eh?
Analysis of the graph (and of the regression model)
So what does the graph tell us? Well, that our regression model isn't that bad! For an average real property price of $15k (x-axis), the predicted price is also around $15k (y-axis). For $10k, it is slightly above ($10-15k) but for a real MEDV of $5k the predicted price is again centred around the $5k mark.
"A property for $15k???" Don't dream guys, remember that the dataset dates from 1978... Those were the days...
This is quite remarkable because remember that the 156 points above are part of the test dataset (the 30% of the 506 rows that we chose randomly not to be part of the training) so basically they have never "seen" the MEDV figures before... Their y-axis value is uniquely extrapolated from the 12 parameters ranging from Crime Rate to Percentage of lower status of the population.
And to start with, the remaining 350 rows the model was trained on, is really a small sample...
Of course there are some outliers, for instance one of the points has a real median price (x-axis) of 3.54 but a predicted value (y-axis) of -3.23, which means that you actually would have to pay me to buy there... But perhaps it is just simply what you need to do? lol
Another outlier has a real price of $16.14k and an estimated value $22.34k. I would definitely pay a visit to the area, could it be massively underrated? Could there be opportunities of very good deals and massive profits?
Conclusion
Instead of the Boston dataset, we of course want to make predictions on real property prices in the UK, using features like location, type of property, number of bedrooms, etc, but also crime rates, proximity to a station, a supermarket, a school...
References:
(1) https://towardsdatascience.com/apache-spark-mllib-tutorial-ec6f1cb336a9