Applications of Statistics in Data Engineering - I
Hello everybody! Most of us have studied Probability and Statistics as a course during our education. But we always wondered what applications does it have in a real life setting. Also, we were never taught about the course with applications in picture. Often like other courses, we also treated this course as a means to score marks! However, at a later stage in our careers, we realize that if we would have focused on the application of Probability and Statistics, we would have been at the forefront of the emerging job market. Better late than never!
I will cover the coding aspects of these applications in Spark and Java. The similar process can be followed for Python and Scala as well. I am going ahead with Java because there are ample tutorials available for Python. Scala and Java have very similar way of coding. Also, Java is more familiar to the community. In this post, I will discuss the theory about the applications of the 'Average' in Data Engineering. In the next post, I will demonstrate each of the average with an application in Java and Spark.
Let us start with the basic topic of 'Average' and see its applications. Average as we simply call it is the result of - Sum of all the quantities in a distribution divided by the number of quantities. But, there is more to it. There are 3 types of Averages
- Mean
- Median
- Mode
Mean
Mean is the simplest form of average. The mean is the result of the summation of all the quantities in a distribution divided by the number of quantities in the distribution. The formula varies between Grouped data (data with frequencies) and ungrouped data (Simple sequence of quantities).
Median
Median is regarded as the central value in the distribution. In order to find out the Median, we arrange the data in ascending order irrespective of the type of distribution, and then select the middle value in the distribution.
Mode
Mode is the value which occurs the most in the distribution.
Applying Mean, Median and Mode in Data Engineering
Now coming to the crux of these topics in Data Engineering. What is the influence of these averages in Data Engineering?
When we are working with Machine Learning and want to build models, we need to be sure about the quality of the data. No one would want to predict incorrect outcomes or come up with false claims. But as always, we rarely get the data which can be directly fed to the model. One such case is missing values and outliers. For this post, we'll focus on Missing Values. We'll cover outliers in the next post.
Missing Values
We encounter missing values in the raw data plenty of times. When opening a CSV file for instance, there are situations wherein few columns in a row are blank. And there can be plenty of rows which have similar appearance. Columns with missing values could be critical features which must be fed to the ML model. In the absence of data for those columns, we might not be able to train the model appropriately and the predictions will be haywire. Also, if we eliminate those rows from the data, we might end up with a small dataset which will result in underfitting. To refrain from falling under such traps, we must deal with missing values with care.
There are 2 types of columns that we generally deal with - Numeric, Textual. . But when there is textual data, we need to perform some transformation and then proceed with applying the appropriate averages.
Dealing with Numeric Data
Numeric data can be quantity, price, aggregation, etc. When we have quantities in the column as values, we can directly go ahead and apply any of the averages to fill missing values. The data semantics here must be quantities. Please note the stress on quantities. The attribute must describe a feature of the data. Depending upon the requirement, we can apply the appropriate average. It also depends upon the ratio of the missing values to the number of rows in the dataset. If the ratio is a small number, we can go ahead use averages. But if the ratio is high, we must look for other measures which we will cover when we pickup those topic from Statistics later.
Dealing with Categorical / Textual Data
Categorical data refers to a value which belongs to a set of fixed classes. Examples could be Gender (Male / Female), Purchase Made (Yes / No), Vehicle Type (SUV / LUV / HMV), etc.
When dealing with Categorical data, we cannot apply the averages directly on them. First, the Categorical variable values are converted to numeric values. There is no special scheme that must be followed while converting these categorical variable values to numeric values. However, we take special care here. It is suggested not to apply Mean and Median on the columns with categorical data. Let us consider a scenario for both these cases. Assuming we are applying Mean on the column with categorical data, the semantics of the data change and we arrive on a wrong conclusion. For example if we apply Mean on Gender column, the resultant value will not fall in any of the category. We face the similar issue with Median as well. We cannot arrange the values in ascending order in the first place. In other words, the order of the data does not matter in this case. Apart from that, the number of records for each gender will sail the value of Median either to Male / Female. Also, the ratio of missing value to the number of rows plays an important role here. We might end up skewing the dataset.
Textual data refers to the data obtained as part of blog texts which are plain comments, description about a feature, free form feedback texts. Until and unless we are dealing with sentiment analysis, Natural Language Processing and other text related processing, we don't use such columns in our models. We will discuss appropriate measures for dealing with textual data in later posts.
Conclusion
The appropriate kind of average can be applied to fill in missing values based on the following factors
- The ratio of number of rows which have the missing values for a column to the total number of rows in the dataset
- The semantics of the data. Whether we are dealing with categorical variables or numeric quantities.
- Default values, if available