Handling Missing Data In Spark Using Python
Often data sources are incomplete, which means we will have missing data, we have some basic options for filling the missing data:
Here I am taking an example using the Jupyter notebook--
We can see in above data we have three columns,
If we look at it at a row basis, so we have one row where we're missing both a name and a sales, and then two rows were missing either a name or a sales value. In the last row has all the information we need.
Drop Missing Data
We can add threshold values which means, it drops rows that have less than this certain number of non null values.
For example, I can type two as my threshold and then see the result in below image.
This actually mean that the row needs to have at least two non null values in order to pass the threshold of being dropped. If we see in the data, we don't find 2nd row which got dropped.
df.na.drop().show() In this query, the parameter `how` is set to any by default.
However, we can also specify all, which means only drop the row if all the values are null values. So in our case, not there's not a single row that has everything null. All of them have at least the ID column. So nothing actually gets dropped there.
We can also play around with the subset parameter and the subset parameter just gives you an optional list of column names to consider. So if we want to consider a certain column as far as missing data, we can clarify that with the subset.
Recommended by LinkedIn
It dropped the row that had no value on the sales.
Fill the Missing Value
Spark is actually smart enough to fill in and match up data types. If we look at the schema, I have a string, a string and a double. We are passing the string parameter in the fill. It fills in that column which has string data types like Name column, we have two missing value here.
It will not change anything in the Sales column because the data type of Sales column is double.
If I pass a numeric value in fill. for example, I just put in zero. It will only replace the null value with 0 in Sales column. It's actually smart enough to not fill in the zero for the string.
Usually, we will specify what columns we want to fill. we wouldn't just depend on Spark to be smart enough to get it for you. Then we would pass in the value you want to fill and use subset parameter. It is recommendable that we should actually declare the subset always, because it just it's nicer that way and we make sure that we know what we are doing.
Suppose we want to fill in any missing values for sales with the average sale value.
That's really all we need to know. Basically, the personal decision of whether we should fill something or drop something is going to have to be evaluated on a case by case basis. There's no general rules on when it's good to fill or when it's good to just drop. It really depends on what algorithm you're using or what's the general objective of the data analysis.
So keep that in mind whenever we work with data that has missing data.
OK, thanks.