How the problem of Big Data is managed by big Industry.
BigData is one of the Problem in today's Data World!!!!

How the problem of Big Data is managed by big Industry.

What is Big Data

Data which are very large in size is called Big Data. Normally we work on data of size MB(WordDoc ,Excel) or maximum GB(Movies, Codes) but data in Peta bytes i.e. 10^15 byte size is called Big Data.

Sources of Big Data

These data come from many sources like

  • Social networking sites: Facebook, Google, LinkedIn all these sites generates huge amount of data on a day to day basis as they have billions of users worldwide.
  • E-commerce site: Sites like Amazon, Flipkart, Alibaba generates huge amount of logs from which users buying trends can be traced.
  • Weather Station: All the weather station and satellite gives very huge data which are stored and manipulated to forecast weather.
  • Telecom company: Telecom giants like Airtel, Vodafone study the user trends and accordingly publish their plans and for this they store the data of its million users.
  • Share Market: Stock exchange across the world generates huge amount of data through its daily transaction.

3V's of Big Data

  1. Velocity: t is the same as of the Input/Output or I/O problems that means the speed of storing data on hard disk & the speed of processing data from hard disk.And the size of data results the velocity.
  2. Variety: Now a days data are not stored in rows and column. Data is structured as well as unstructured. Log file, CCTV footage is unstructured data. Data which can be saved in tables are structured data like the transaction data of the bank.in this we consider the nature & type of data. As earlier we use RDBMS which was capable to handle structured data efficiently and effectively.
  3. Volume: The amount of data which we deal with is of very large size of Peta bytes.

Issues

Huge amount of unstructured data which needs to be stored, processed and analyzed.

We all use various social media applications like Instagram, Facebook, Twitter, Netflix and many more…..

Have you ever imagined that images, videos that we upload everyday, where this data is stored ?? The Data is very large in amount something around petabytes of data is uploaded everyday so how these big companies would be managing such a huge amount of data ?

FACEBOOK : Facebook generates 4 petabytes of data per day — that’s a million gigabytes and 2.5 Billion Pieces Of Content and the Facebook like button has been pressed 1.13 trillion times. 100 million hours of video are watched on Facebook every day. Every 60 seconds, 317,000 status updates; 400 new users; 147,000 photos uploaded; and 54,000 shared links. Facebook Gets Over 8 Billion Average Daily Video Views.

INSTAGRAM : The Instagram explore page is viewed by 200 million accounts daily. More than 50 billion photos have been uploaded to Instagram so far. 95 million photos and videos are shared on Instagram per day. 300 million users use the “stories” feature daily.

NETFLIX : There are 1 million Netflix subscribers. Netflix users stream 97,222 hours of video every minute.

And many more such type of examples. How these companies are managing such a huge data?? Now, here the concept of BIG DATA comes in role.

So to solve issue mainly known as 3 V’s Velocity, Variety,Volume we use a concept named DISTRIBUTED STORAGE.

DISTRIBUTED STORAGE

Distributed Storage is an infrastructure that splits a huge amount of data into small blocks i.e., to divide it into independent physical servers across more than one data center.

This solves the above issues i.e., since we are splitting the data into small blocks so the volume gets reduced and since we are storing data in parallel this saves our time and also reduces the input/output issue i.e., velocity. More and more independent servers the less time required to store data. And the data stored is permanent.

The topology that we use is MASTER/SLAVE MODEL. Master/slave is a model of asymmetric communication or control where one device or process (the “master”) controls one or more other devices or processes (the “slaves”) and serves as their communication hub.

The above whole team is known as cluster. So it is known as DISTRIBUTING STORAGE CLUSTER. To implement any concept we need a software. The software that we use to Distribute Storage is HADOOP.

What is Hadoop

Hadoop is an open source framework from Apache and is used to store process and analyze data which are very huge in volume. Hadoop is written in Java and is not OLAP (online analytical processing). It is used for batch/offline processing.It is being used by Facebook, Yahoo, Google, Twitter, LinkedIn and many more. Moreover it can be scaled up just by adding nodes in the cluster.

Hadoop Distributed File System

The Hadoop Distributed File System (HDFS) is a distributed file system for Hadoop. It contains a master/slave architecture. This architecture consist of a single NameNode performs the role of master, and multiple DataNodes performs the role of a slave.

Both NameNode and DataNode are capable enough to run on commodity machines. The Java language is used to develop HDFS. So any machine that supports Java language can easily run the NameNode and DataNode software.

Advantages of Hadoop

  • Fast: In HDFS the data distributed over the cluster and are mapped which helps in faster retrieval. Even the tools to process the data are often on the same servers, thus reducing the processing time. It is able to process terabytes of data in minutes and Peta bytes in hours.
  • Scalable: Hadoop cluster can be extended by just adding nodes in the cluster.
  • Cost Effective: Hadoop is open source and uses commodity hardware to store data so it really cost effective as compared to traditional relational database management system.
  • Resilient to failure: HDFS has the property with which it can replicate data over the network, so if one node is down or some other network failure happens, then Hadoop takes the other copy of data and use it. Normally, data are replicated thrice but the replication factor is configurable.

ThankYou!!!



To view or add a comment, sign in

More articles by Manisha Soni

  • AWS (Amazon Web Services)

    What is Cloud Computing The term cloud refers to a network or the internet. It is a technology that uses remote servers…

  • Google Cloud Platform(GCP)

    Google Cloud Platform (GCP), offered by Google, is a suite of cloud computing services that runs on the same…

    6 Comments
  • Flutter

    Flutter Flutter is a UI toolkit for building fast, beautiful, natively compiled applications for mobile, web, and…

    3 Comments

Others also viewed

Explore content categories