Big Data

DEEPAK KAUSHIK

Published Sep 17, 2020

What is Big Data ?

Big Data is also data but with a huge size. Big Data is a term used to describe a collection of data that is huge in volume and yet growing exponentially with time. In short such data is so large and complex that none of the traditional data management tools are able to store it or process it efficiently. It is the data exceeding Terabytes in size. Because of the variety of data that it encompasses, big data always brings a number of challenges relating to its volume and complexity. A recent survey says that 80% of the data created in the world are unstructured. We can categories them into two (storage and Querying/Analysis).

Why Big Data?

According to an influential report by a company called McKinsey in 2013 claimed that the area of data science will be the number one catalyst for economic growth. McKinsey identified one of our new opportunities that contributed to the launch of the big data era. A growing torrent of data.

This refers to the idea that data seems to be coming continuously and at a fast rate. Think about this, today you can buy a hard drive to store all the music in the world for only $600. That’s an amazing storage capability over any previous forms of music storage.

Why to store a huge amount of Data?

As daily billions of user use internet, so huge amount of data is to be stored

Businesses also collect a huge amount of data to improve their business by improving customer experience, redefining marketing strategy etc.
Businesses like Netflix, YouTube, Amazon prime and other video streaming platforms use user data to give them video recommendation according to their choice.
Businesses like Amazon, Flipkart, Myntra uses user data to recommend products according to their preference.

Why is Big Data Important ?

Cost Savings
Time Reductions
Understand the market conditions
Control online reputation
Using Big Data Analytics to Boost Customer Acquisition and Retention
Using Big Data Analytics to Solve Advertisers Problem and Offer Marketing Insights
Big Data Analytics As a Driver of Innovations and Product Development.

Google’s solution to this: Distributed File System, Big Table and Object Based Storage!

Distributed multi-level map
Fault-tolerant, persistent
Scalability

Thousands of servers
Terabytes of in-memory data
Petabyte of disk-based data
Millions of reads/writes per second, efficient scans

Infrastrucure :

Building blocks:

Google File System (GFS): Raw storage
Scheduler: schedules jobs onto machines
Lock service: Chubby-distributed lock manager
MapReduce: simplified large-scale data processing

BigTable uses of building blocks:

GFS: stores persistent data (SSTable file format for storage of data)
Scheduler: schedules jobs involved in BigTable serving
Lock service: master election, location bootstrapping
Map Reduce: often used to read/write BigTable data
Compression, to reduce the size of the data.
Many opportunities for compression
Similar values in the same row/column at different timestamps
Similar values in different columns
Similar values across adjacent row

How Amazon uses big data:

Amazon has thrived by adopting an “everything under one roof” model. However, when faced with such a huge range of options, customers can often feel overwhelmed. They effectively become data-rich, with tons of options, but insight-poor, with little idea about what would be the best purchasing decision for them.

To combat this, Amazon uses Big Data gathered from customers while they browse to build and fine-tune its recommendation engine. The more Amazon knows about you, the better it can predict what you want to buy. And, once the retailer knows what you might want, it can streamline the process of persuading you to buy it — for example, by recommending various products instead of making you search through the whole catalogue.

Amazon gathers data on every one of its customers while they use the site. As well as what you buy, the company monitors what you look at, your shipping address (Amazon can take a surprisingly good guess at your income level based on where you live), and whether you leave reviews/feedback.

Big data can be described by the following characteristics:

Volume : To store this much amount of Data we need lots of Storage. Think like in whole world the biggest storage exist is 10 GB, But your Data is 20 GB, so how you gonna fit it ? According to Popular Storage Solution Companies like Dell EMC, IBM etc. it’s not a big deal to create huge amount of storage. But if we store the data inside one or few big big storages then we are also facing two more issues. 1st one is costing and another one is velocity. Also if somehow any storage that is huge, gets corrupted, then that will be the biggest disaster for company. Here, I am just trying to tell you very few key challenges we have under Big Data. Don’t think that these few are the only challenges.

Velocity: Usually when we store our data in RAM then you will notice RAM is super fast. But when we store our data on Hard Disks or SSD then it’s comparatively very slower. Now you will easily say then store Data on RAM, why you need SSD or Hard Disk to store. The problem is in the architecture of RAM. As, RAM is ephemeral storage so as soon as you close any program it gets vanished from RAM. That means, we can’t store Data permanent on RAM. So, we need to find some kind of solutions which are faster means which can read and write data very faster.

Variety: The type and nature of the data. The earlier technologies like RDBMSs were capable to handle structured data efficiently and effectively. However, the change in type and nature from structured to semi-structured or unstructured challenged the existing tools and technologies. The Big Data technologies evolved with the prime intention to capture, store, and process the semi-structured and unstructured (variety) data generated with high speed(velocity), and huge in size (volume). Later, these tools and technologies were explored and utilized for handling structured data also but preferable for storage. Eventually, the processing of structured data was still kept as optional, either using big data or traditional RDBMSs. This helps in analyzing data towards effective usage of the hidden insights exposed from the data collected via social media, log files, and sensors, etc. Big data draws from text, images, audio, video; plus it completes missing pieces through data fusion.

Veracity: It is the extended definition for big data, which refers to the data quality and the data value. The data quality of captured data can vary greatly, affecting the accurate analysis.

Variability : It refers to data whose value or other characteristics are shifting in relation to the context in which they are being generated.

Fields of Use of Big Data :

Banking and Security.
Communication, Media and Entertainment.
Healthcare Providers.
Education.
Manufacturing and Natural Resources.
Government.
Insurance.
Retail and Wholesale Trade.
Transportation.
Energy and Utilities.

Big Data Companies :

Microsoft is US-based Software and Programming Company, founded in 1975 with headquarters in Washington. As per Forbes list, it has a Market Capitalization of $507.5 billion and $85.27 billion of sales. It currently employed around 114,000 employees across the globe.

Microsoft’s Big Data strategy is wide and growing fast. This strategy includes a partnership with Hortonworks which is a Big Data startup. This partnership provides HDInsight tool for analyzing structured and unstructured data on Hortonworks data platform (HDP)

Recently Microsoft has acquired Revolution Analytics which is a Big Data Analytics platform written in “R” programming language. This language used for building Big Data apps that do not require a skill of Data Scientist.

Facebook revealed some big, big stats on big data to a few reporters at its HQ today, including that its system processes 2.5 billion pieces of content and 500+ terabytes of data each day. It’s pulling in 2.7 billion Like actions and 300 million photos per day, and it scans roughly 105 terabytes of data each half hour. Plus it gave the first details on its new “Project Prism”.

Every day, we feed Facebook’s data beast with mounds of information.Every 60 seconds , 136,000 photos are uploaded, 510,000 comments are posted, and 293,000 status updates are posted. That is a lot of data.

At first, this information may not seem to mean very much. But with data like this, Facebook knows who our friends are, what we look like, where we are, what we are doing, our likes, our dislikes, and so much more. Some researchers even say Facebook has enough data to know us better than our therapists!

Google is founded in 1998 and California is headquartered. It has $101.8 billion market capitalization and $80.5 billion of sales as of May 2017. Around 61,000 employees are currently working with Google across the globe.

Google provides integrated and end to end Big Data solutions based on innovation at Google and help the different organization to capture, process, analyze and transfer a data in a single platform. Google is expanding its Big Data Analytics; BigQuery is a cloud-based analytics platform that analyzes a huge set of data quickly.

Google currently processes over 20 petabytes of data per day through an average of 100,000 Map Reduce jobs spread across its massive computing clusters. The average Map Reduce job ran across approximately 400 machines in September 2007, crunching approximately 11,000 machine years in a single month.

International Business Machine (IBM) is an American company headquartered in New York. IBM is listed at 43 in Forbes list with a Market Capitalization of $162.4 billion as of May 2017. The company’s operation is spread across 170 countries and the largest employer with around 414,400 employees.

IBM has a sale of around $79.9 billion and a profit of $11.9 billion. In 2017, IBM holds most patents generated by the business for 24 consecutive years.

IBM is the biggest vendor for Big Data-related products and services. IBM Big Data solutions provide features such as store data, manage data and analyze data.

Oracle offers fully integrated cloud applications, platform services with more than 420,000 customers and 136,000 employees across 145 countries. It has a Market capitalization of $182.2 billion and sales of $37.4 B as per Forbes list.

Oracle is the biggest player in the Big Data area, it is also well known for its flagship database. Oracle leverages the benefits of big data in the cloud. It helps organizations to define its data strategy and approach which includes big data and cloud technology.

It provides a business solution that leverages Big Data Analytics, applications, and infrastructure to provide insight for logistics, fraud, etc. Oracle also provides Industry solutions which ensure that your organization takes advantage of Big Data opportunities.

Netflix- The entertainment streaming service has a wealth of data and analytics providing insight into the viewing habits of millions of international consumers.

Netflix uses this data to commission original programming content that appeals globally as well as purchasing the rights to films and series boxsets that they know will perform well with certain audiences.

For example, Adam Sandler has proven unpopular in the US and UK markets in recent years but Netflix green-lighted four new films with the actor in 2015, armed with the knowledge that his previous work had been successful in Latin America.

What is Hadoop ?

Hadoop is an open-source software framework for storing data and running applications on clusters of commodity hardware. It provides massive storage for any kind of data, enormous processing power and the ability to handle virtually limitless concurrent tasks or jobs.

Importance of Hadoop :

Ability to store and process huge amounts of any kind of data, quickly.

With data volumes and varieties constantly increasing, especially from social media and the Internet of Things (IoT) , that's a key consideration.

Computing power.

Hadoop's distributed computing model processes big data fast. The more computing nodes you use, the more processing power you have.

Fault tolerance.

Data and application processing are protected against hardware failure. If a node goes down, jobs are automatically redirected to other nodes to make sure the distributed computing does not fail. Multiple copies of all data are stored automatically.

Flexibility.

Unlike traditional relational databases, you don’t have to preprocess data before storing it. You can store as much data as you want and decide how to use it later. That includes unstructured data like text, images and videos.

Low cost.

The open-source framework is free and uses commodity hardware to store large quantities of data.

Scalability.

You can easily grow your system to handle more data simply by adding nodes. Little administration is required.

conclusion:

The convergence of these trends means that we have the capabilities required to analyze astonishing data sets quickly and cost-effectively for the first time in history. These capabilities are neither theoretical nor trivial. They represent a genuine leap forward and a clear opportunity to realize enormous gains in terms of efficiency, productivity, revenue, and profitability.

That's All Thank You

To view or add a comment, sign in

Big Data

DEEPAK KAUSHIK

What is Big Data ?

More articles by DEEPAK KAUSHIK

Others also viewed

BIG DATA

Big Data and the forgotten V

Big DATA Analytics - Big Data & Data Analytics

Warehouses, Lakes, and Franken-house

Challenges in Big Data Analytics

Broadgate Big Data Dictionary

Understanding Big Data

Big Data Application in Near Real Time Analytics

Harnessing Big Data

Explore content categories

What is Big Data ?

More articles by DEEPAK KAUSHIK

Task-06 Deploying Wordpress Server using AWS-RDS, Minikube, and Terraform

Create a Personal VPC using NAT Gateway and Integrate it with EC2 using Terraform

Amazon Web Services (AWS) & NETFLIX

Launching the WordPress And Kubernetes Load balancer service with GCP

Task-03 Create a Personal VPC and Integrate it with EC2(public and private subnet)

AWS and integrating with EFS(storage) and Cloudfront using Terraform

Python

Completion of Amazon EKS training certificate

AWS ELASTIC KUBERNETES SERVICE(EKS)

INTEGRATION OF AWS & TERRAFORM

Others also viewed

BIG DATA

Big Data and the forgotten V

Big DATA Analytics - Big Data & Data Analytics

Warehouses, Lakes, and Franken-house

Challenges in Big Data Analytics

Broadgate Big Data Dictionary

Understanding Big Data

Big Data Application in Near Real Time Analytics

Harnessing Big Data

Similar topics

Big Data Analysis for Consumer Behavior

Managing Big Data in the Music Industry

Leveraging Big Data for Enhanced Service Delivery

Explore content categories