Missing data a pain in big data analytics?

Krishna Kumar Nagalingam, PhD

Published Feb 22, 2017

Big data happening now and it has taken up a strong position across various industries and business units. Big data is a term applied to data sets whose size or type is beyond the ability of traditional related databases to capture, manage and process the data with low-latency. It has one or more of the following characteristics- high volume, high velocity or high variety. The data comes from sensors, devices, video/audio, networks, log files, web and social media etc., Much of it generated in real time and in a very large scale. Analyzing such big data allows analysts, researchers and business users to get a insight on cohesion between the data and help them to make better and faster decisions. Using advanced in analytics techniques such as text analysis, machine learning, predictive analysis, data mining, statistics and natural language processing etc., industries are able to hone their decision making capability. In addition, analysis of big data also brings in new products and services for customer needs and statisfaction.

With all the immense values that big data can bring to industries, business, customers and people there is an excruciating pain missing data that needs to be addressed. Missing data is a serious threat on data quality as it can have great impact on certainty of what being presented to the end user. But how to deal with this missing values -ignore or treat them ? The answer would depend on the percentage of those missing values in dataset, the variables affected by missing values, whether those missing values are a part of dependent or the independent variables etc., Missing value treatment is therefore important since the data insights or the performance of products or services rendered could be impacted if the missing values are not appropriately handled.

There are several methods available to handle this missing data. Deletion is the best method in many cases like list-wise or pairwise deletion but both methods suffer from loss of information. Imputation methods like averaging techniques, predictive techniques are also a wise choice. Mean, median and mode are the popular averaging techniques, which are used to infer missing values. Approaches ranging from global average for the variable to averages based on groups are usually considered. Though we can get a quick estimate of missing values, you are artificially reducing the variation in the dataset as the missing observations could have the same value. This may impact the statistical analysis of the dataset since depending on the percentage of missing observations imputed , metrics such as mean, median, correlation etc may be affected.

Imputation of missing values from predictive techniques assumes that the nature of such missing observations are not observed at random and the variables chosen to impute such missing observations have some relationship with it, else it could yield imprecise estimates. There are various statistical methods like regression techniques such as auto-regressive, auto-regressive moving average (ARMA) or auto-regressive integrated moving average (ARIMA), machine learning methods like neural networks, genetic programming, support vector machines and data mining methods are used to impute such missing observations. Of all these methods predictive techniques are highly desirable since it can lead to better insights and overall increase in performance of the products/ services rendered with big data analytics.

Thus the overall analysis can be flawed if missing observations are not properly defined. The reliability of the data in big data analytics is vital inorder to have exceptional results. Industries and business units that have started to use big data for their products and services for customers must treat this missing data attentively to achieve best performance.

To view or add a comment, sign in

Missing data a pain in big data analytics?

Krishna Kumar Nagalingam, PhD

More articles by Krishna Kumar Nagalingam, PhD

Others also viewed

Exploratory Data Analysis - Critical step for AI / ML based solution

Data analysis

Demystifying Model Selection: Finding the Perfect Fit for Your Data

Introducing Databot: An AI assistant for exploratory data analysis

(meta)-data Quality Workshop @TPDL2017

Exploratory Data Analysis: The Underrated Superpower of Data Scientists

Basic Building Blocks of K-Means Clustering Algorithms

Quick Review: Model Development and Business Metric Evaluation using Sci-Kit Learn

K-Mean Clustering

Data Science

Analyzing Pair Relationships in Big Data Sets

The Impact of Big Data on Consulting Strategies

Big Data Analytics Implementation Issues

Big Data Applications in Forecasting

Big Data Applications in Logistics

Ways To Use Data Analytics For Customer Support Insights

Explore content categories

More articles by Krishna Kumar Nagalingam, PhD

Electrical Ecosystem: A combination of Electrical system model and Ecological Model

Decision-making for intelligent system with aid of knowledge graph

Optimal sizing of battery for ship power system

Self Adaptive MCC's for Smart Energy Management

Heave Stabilization of offshore structures in irregular waves based on predicted Vessel RAO

Intelligent Energy Management System (IEMS) for Ships

Power Plant Protection System using Long Short-Term Memory (LSTM) Neural Networks

Classification of Environmental vulnerability regions of Marine Aquaculture using Neural Networks

Inter-Data Analytics for Optimization of Offshore operations and planning

Augmented Reality for Smart Bridge Operations