Handling Missing Data In Spark Using Python

Devesh .

Published Jul 21, 2021

Often data sources are incomplete, which means we will have missing data, we have some basic options for filling the missing data:

Keep the missing data points and adjust with null value
Drop the missing data points including the entire row
Fill them in with some other value

Here I am taking an example using the Jupyter notebook--

We can see in above data we have three columns,

ID column where there are no missing values
Name column where we have two missing values
Sales column at the end with the first two being nulls.

If we look at it at a row basis, so we have one row where we're missing both a name and a sales, and then two rows were missing either a name or a sales value. In the last row has all the information we need.

Drop Missing Data

We can add threshold values which means, it drops rows that have less than this certain number of non null values.

For example, I can type two as my threshold and then see the result in below image.

This actually mean that the row needs to have at least two non null values in order to pass the threshold of being dropped. If we see in the data, we don't find 2nd row which got dropped.

df.na.drop().show() In this query, the parameter `how` is set to any by default.

However, we can also specify all, which means only drop the row if all the values are null values. So in our case, not there's not a single row that has everything null. All of them have at least the ID column. So nothing actually gets dropped there.

We can also play around with the subset parameter and the subset parameter just gives you an optional list of column names to consider. So if we want to consider a certain column as far as missing data, we can clarify that with the subset.

Recommended by LinkedIn

Data Profiling: Large datasets?

Ricardo Sanchez, MBA 5 years ago

🐼 Pandas in Python: A Complete Practical Guide to…

Kamal Kishor 4 months ago

Mastering Data Analysis with Pandas: The Power of…

Sethumadhavan V 6 months ago

It dropped the row that had no value on the sales.

Fill the Missing Value

Spark is actually smart enough to fill in and match up data types. If we look at the schema, I have a string, a string and a double. We are passing the string parameter in the fill. It fills in that column which has string data types like Name column, we have two missing value here.

It will not change anything in the Sales column because the data type of Sales column is double.

If I pass a numeric value in fill. for example, I just put in zero. It will only replace the null value with 0 in Sales column. It's actually smart enough to not fill in the zero for the string.

Usually, we will specify what columns we want to fill. we wouldn't just depend on Spark to be smart enough to get it for you. Then we would pass in the value you want to fill and use subset parameter. It is recommendable that we should actually declare the subset always, because it just it's nicer that way and we make sure that we know what we are doing.

Suppose we want to fill in any missing values for sales with the average sale value.

That's really all we need to know. Basically, the personal decision of whether we should fill something or drop something is going to have to be evaluated on a case by case basis. There's no general rules on when it's good to fill or when it's good to just drop. It really depends on what algorithm you're using or what's the general objective of the data analysis.

So keep that in mind whenever we work with data that has missing data.

OK, thanks.

To view or add a comment, sign in

Handling Missing Data In Spark Using Python

Devesh .

Drop Missing Data

Recommended by LinkedIn

Fill the Missing Value

More articles by Devesh .

Others also viewed

Stepping up my SQL game

Pandas Data Types: A Detailed Explanation

"PyDataViz: Interactive Data Analysis with Python"

The Insight Ledger: Python Scripts Every Data Scientist Needs for Quick Validations

Beyond the VLOOKUP: How Python Handles 70MB Datasets Without the "Circle of Death"

Using Python and Pandas to find Correlation between Hotel price and Voting

Python-Powered Excel: An Advantage for Data Enthusiasts and Business Executives

Up and Running with Python Pandas in 1 Hour

Intro to Pandas

Pandas:A Swiss knife for EDA

Explore content categories

Drop Missing Data

Recommended by LinkedIn

Fill the Missing Value

More articles by Devesh .

Understanding Azure Data Factory

Why Should We Opt Azure Synapse Analytics?

Others also viewed

Stepping up my SQL game

Pandas Data Types: A Detailed Explanation

"PyDataViz: Interactive Data Analysis with Python"

The Insight Ledger: Python Scripts Every Data Scientist Needs for Quick Validations

Beyond the VLOOKUP: How Python Handles 70MB Datasets Without the "Circle of Death"

Using Python and Pandas to find Correlation between Hotel price and Voting

Python-Powered Excel: An Advantage for Data Enthusiasts and Business Executives

Up and Running with Python Pandas in 1 Hour

Intro to Pandas

Pandas:A Swiss knife for EDA

Explore content categories