Big Data - How Big is it?

Big Data - How Big is it?

I lived in Texas for 14 years so I consider myself (almost) a Texan. Big Data is simply called Data in Texas as everything is assumed to be Big there. All joking aside, you have heard the term Big Data over past 5 years or so. Ever wondered why it is called Big Data? What is the significance of Big Data? Why is it getting popular now? Wonder no more. In this blog I plan to explain the concept of Big Data. For this blog writing, I will only stay at a concept level as it is important to understand the concept first and will review technology aspect of Big Data in subsequent blog posts.

Wikipedia defines Big Data as follows.

Big data usually includes data sets with sizes beyond the ability of commonly used software tools to capture, curate, manage, and process data within a tolerable elapsed time.[Big data "size" is a constantly moving target, as of 2012 ranging from a few dozen terabytes to many petabytes of data. Big data requires a set of techniques and technologies with new forms of integration to reveal insights from datasets that are diverse, complex, and of a massive scale.

Big Data is not a new concept. It has evolved over a few decades. Traditional DBMS (Database Management Systems) were all about structured data and fixed (however big) size in their characteristics. Let us use payroll example of a company to illustrate this point. A database containing payroll data would have a record for each employee with: Employee Name, Employee Serial Number, Employee Salary, Employee Social Security Number, Employee Address etc. As you can see that is structured in very nicely organized record fashion with fixed length records. This DBMS technology and traditional Data Sets served the needs well in the past. The traditional data sets are also called structured data. However, with newer technology, newer business requirements structured approach to storing and analyzing data was no longer fit for use. What happens when you need to accommodate both structured and unstructured data? Think of images, variable length texts and videos as unstructured data. Big Data allows storing of structured and unstructured data in the same place.

Big Data consists of 3 original Vs and two additional characteristics. The 3 Vs are: Velocity, Volume and Variety. Later two additional dimensions were added: Digital Footprint and Machine Learning.

Velocity: For certain types of application both creation and consumption of data has to be either real time or near real time. During holiday shopping season, online retailers need to run analytics in real time to take action on shoppers' behavior. As shoppers navigate through pages on retailer's website, it needs to turn into sell. Thus real time or near real time analytics is crucial to convert clicks to sell.

Volume: According Harvard Business Review, as of 2012, approximately 2.5 Exabytes (Exabyte = 1 Billion Gigabytes) of data is created daily. This number doubles approximately every 3 years.The volume of data is created from various sources - smartphones, IoT devices, computers, digitization of records, digitization of processes to name a few. Walmart stores alone create 2.5 Petabytes of data only in one hour from their sales transactions.

Variety: As we introduce different types of devices and their integration with various systems, it generates variety of data which did not exist just a few years ago. GPS signals to cellphones and to connected cars are a type of data were not pervasive a decade ago. Social networks such as Twitter and Facebook generate tremendous variety of data.

Digital Footprint: Today's enterprises are really into creating and exposing their footprints to create competitive advantage. Also health records throughout the world are becoming increasingly digital.

Machine Learning: Server farms throughout the world are creating unprecedented amount of data through IT Operations. Besides servers many digital devices create logs. These logs are then used to perform analytics for highly available infrastructure and devices.

 

The concept of Data Lake is created to house various types of data sets: External, Internal, Structure, Unstructured etc. The thought here is to provide a mechanism to store data of various types that can be used to perform analytics. Thoughts behind Big Data and Data Lake is not only about storing heterogeneous data types but more importantly storing them running analytics to find actionable insights. In order to create Data Lakes enterprises need highly skilled Data Scientists. They are responsible for understanding various data sources and then combining them in creating effective and efficient Data Lakes. Across IBM we have numerous Data Scientists who provide expertise to both IBM clients and IBM. One of the reasons for IBM to acquire the Weather company was to bring Data Scientist expertise to IBM, besides other business reasons.

As new technology and concepts become the norm in the industry, one has to still think through traditional aspects of IT. For examples, as we create Data Lakes through various sources and enables them for various needs, we also need to worry about not to have Data Spills. Data Spills are a security aspect so that unauthorized users or hackers do not create security exposure. Also as we integrate data from external sources, we need to be mindful of legal aspects of using data created by someone else.

Big Data is emerging field and it is creating exciting new business opportunities by housing various data types and running analytics to seek actionable insights. 

Views expressed in here are of mine and not of IBM.

To view or add a comment, sign in

More articles by Hemang Davé

Others also viewed

Explore content categories