Missing data a pain in big data analytics?
Big data happening now and it has taken up a strong position across various industries and business units. Big data is a term applied to data sets whose size or type is beyond the ability of traditional related databases to capture, manage and process the data with low-latency. It has one or more of the following characteristics- high volume, high velocity or high variety. The data comes from sensors, devices, video/audio, networks, log files, web and social media etc., Much of it generated in real time and in a very large scale. Analyzing such big data allows analysts, researchers and business users to get a insight on cohesion between the data and help them to make better and faster decisions. Using advanced in analytics techniques such as text analysis, machine learning, predictive analysis, data mining, statistics and natural language processing etc., industries are able to hone their decision making capability. In addition, analysis of big data also brings in new products and services for customer needs and statisfaction.
With all the immense values that big data can bring to industries, business, customers and people there is an excruciating pain missing data that needs to be addressed. Missing data is a serious threat on data quality as it can have great impact on certainty of what being presented to the end user. But how to deal with this missing values -ignore or treat them ? The answer would depend on the percentage of those missing values in dataset, the variables affected by missing values, whether those missing values are a part of dependent or the independent variables etc., Missing value treatment is therefore important since the data insights or the performance of products or services rendered could be impacted if the missing values are not appropriately handled.
There are several methods available to handle this missing data. Deletion is the best method in many cases like list-wise or pairwise deletion but both methods suffer from loss of information. Imputation methods like averaging techniques, predictive techniques are also a wise choice. Mean, median and mode are the popular averaging techniques, which are used to infer missing values. Approaches ranging from global average for the variable to averages based on groups are usually considered. Though we can get a quick estimate of missing values, you are artificially reducing the variation in the dataset as the missing observations could have the same value. This may impact the statistical analysis of the dataset since depending on the percentage of missing observations imputed , metrics such as mean, median, correlation etc may be affected.
Imputation of missing values from predictive techniques assumes that the nature of such missing observations are not observed at random and the variables chosen to impute such missing observations have some relationship with it, else it could yield imprecise estimates. There are various statistical methods like regression techniques such as auto-regressive, auto-regressive moving average (ARMA) or auto-regressive integrated moving average (ARIMA), machine learning methods like neural networks, genetic programming, support vector machines and data mining methods are used to impute such missing observations. Of all these methods predictive techniques are highly desirable since it can lead to better insights and overall increase in performance of the products/ services rendered with big data analytics.
Thus the overall analysis can be flawed if missing observations are not properly defined. The reliability of the data in big data analytics is vital inorder to have exceptional results. Industries and business units that have started to use big data for their products and services for customers must treat this missing data attentively to achieve best performance.