Imputing Null Values with Regression

Maurie Kathan

Published Jan 2, 2019

Recently my Data Science Immersive class participated in the Ames Housing Data Kaggle Competition. The prompt was to take a training dataset and create a model to predict sales prices for a test dataset. In doing exploratory data analysis I discovered a high number of null values in the Lot Frontage feature. There were nulls in other features but generally they could be explained as the data collector leaving a feature empty rather than filling in a zero and this was easy to fix in the data and feel confident that assertion wasn’t going to influence the model.

But Lot Frontage…..the nulls in this feature bugged me. Out of 2051 houses 330 of them had no value in lot frontage. What does lot frontage even mean, you ask? Our data description defined it as: “Linear feet of street connected to property.” Wikipedia defines it as “Frontage is the boundary between a plot of land or a building and the road onto which the plot or building fronts.”¹ Both of these definitions indicate that it should be impossible for a plot to have lot frontage. All property must front at least a little bit of street. It may just be the the four feet of your driveway but there should be at least a little.

But there are a few theories that could explain it.

All the null plots are on an alley
All the null plots are actually condos

But if we look into both of these theories we discover that neither of them really explain the berth of null properties.

If we look at the alley feature in the dataset we discover that for our null lot frontage houses that only 8 of them have alleys so this theory can’t explain the prevalence.
When we look at the class of houses we find that none of them are multi-residential and 131 of them are marked as ‘1-STORY 1946 & NEWER ALL STYLES’ (131 ) so they can not all be condos.

So since neither of these theories is evident in this dataset we need to assume that there is some entry error issue and if we want to use this feature as a predictor in our model then we need to figure out something to fill into these nulls that make sense.

What I decided to do was create a linear model that would predict the lot frontage from two other features.

But I didn’t know which features I wanted to use. I felt like Lot Area (lot size in square feet), MS Zoning (identifies the general zoning classification of the sale), Lot Shape (general shape of property), Lot Config (lot configuration), and Neighborhood were all potentially good predictors. So I created a Lasso model comparing all of these factors and lot frontage.

To create my model I first created a dataframe with only the houses with lot frontage (excluding those with null values).

not_null_frontage = train_df[train_df['Lot Frontage'].notnull()]

All of the variables except Lot Area are categorical so I needed to dummy those in order to model with them

frontage_math = not_null_frontage[['Lot Frontage','Lot Area','MS Zoning','Lot Shape','Lot Config','Neighborhood']]
frontage_math = pd.get_dummies(frontage_math,drop_first=True)

I then built my Lasso model and scored it.

Variables             | Coefficient 
Lot Area              | 11.720041
Lot Config_CulDSac    | -7.334338
Lot Config_Inside     | 4.910429
Neighborhood_NridgHt  | 2.925914
Lot Config_FR2        | -2.865836
MS Zoning_RL          | 2.278753
Neighborhood_BrDale   | -2.249133

When I reviewed the results of the Lasso model I found that the best predictor of lot frontage was lot area.

I decided to create my imputing model from lot frontage and neighborhood as I noticed that neighborhood was a decently good predictor and I also figured that lot area would be similar neighborhood to neighborhood. To review if I could make a linear model with these two features I checked the relationship between lot frontage and area by neighborhood.

We notice that there is a linear relationship between the two so modeling is possible.

I then created a linear regression model from neighborhood² and lot area. To use this model to impute the nulls I needed to create a function and apply it to all of the rows.

def impute_lot_frontage(row):
    if pd.isnull(row['Lot Frontage']):
        return np.exp(pipe_lr.predict([row[features]]))[0]
    else:
        return row['Lot Frontage']
train_df['Lot Frontage'] = train_df.apply(impute_lot_frontage, axis=1)
##pipe_lr is a pipeline with a standard_scaler and a linear regression

As a note I recognize that lot frontage is not that important of a predictor of sales price but I thought it was an interesting feature to practice figuring out how to impute data.

https://en.wikipedia.org/wiki/Frontage
For those considering doing this for this data set a thing to note is that this model requires the neighborhoods to be in the same order. You need to do a little bit of data munging to make this work.

#filling in missing columns in case the neighborhoods were different
#code inspired from Kate Dowdy
missing_cols = set(test_df.columns) - set(train_df.columns)
# for missing columns, setting this to 0 for test
for c in missing_cols:
    train_df[c] = 0
missing_cols = set(train_df.columns) - set(test_df.columns)
# for missing columns, setting this to 0 for test
for c in missing_cols:
    test_df[c] = 0

#sorting so the values are in the same order for regression
train_df = train_df.reindex(sorted(train_df.columns), axis=1)

To view or add a comment, sign in

Imputing Null Values with Regression

Maurie Kathan

More articles by Maurie Kathan

Others also viewed

Data Pre-Processing for Real Estate House Price Prediction: A Comprehensive Guide

In Defense of Small Data

Are you tackling the real issues of Big Data?

Do numbers really speak for themselves with big data?

Part 2 - "I See Patterns" Phenomenon

Time Series Analysis and Forecasting

The Evolution of k-Nearest Neighbors

Stop Flying Blind: My (new) workflow for sniffing and scrutinizing initial data

From Raw Data to Robust Models: A Semester in Practical Data Science

Rethinking Similarity: How DIEM Outperforms Cosine in High Dimensions

Explore content categories