How big big data?
Big data analysis and handling

How big big data?

Hello everyone, this article is about the how top MNC'S and social media platform like Facebook, Instagram, Netflix many more manages such a huge amount of data of the user's. Data is too large that none of the traditional data management tools are able to store it or process it efficiently. Lets look up at the some of technology to manage big data.

What is Big data?

Big data refers to the large, diverse sets of information that grow at ever-increasing rates. It encompasses the volume of information, the velocity or speed at which it is created and collected, and the variety or scope of the data points being covered.

Examples of Big Data:

The New York Stock Exchange generates about one terabyte of new trade data per day.

No alt text provided for this image

The statics shows that 500+terabytes of new data get ingested by into the databases of social media site Facebook, every day. This data is mainly generated in terms of photo and video uploads, message exchanges, putting comments,etc.

No alt text provided for this image

A single jet engine can generate 10+terabytes of data in 30 minutes of flight time. With many thousands flights per day, generation of data reaches up to many Petabytes.

No alt text provided for this image

Characteristics of Big Data

The main characteristics are commonly referred to as the four Vs – Volume, Velocity, Variety, and Veracity. In the business world, these are the high-level dimensions that data analysts, scientists and engineers use to break everything down .

No alt text provided for this image


VOLUME

We’re talking about datasets that stretch into the petabytes and exabytes here. These huge volumes require powerful processing technologies – much more powerful than a regular laptop or desktop processor. 

The world’s most popular social media platform face book now has more than 2.2 billion active users. many of whom spend hours each day posting updates, commenting on images, liking posts, clicking on ads, playing games, and doing a zillion other things that generate data that can be analyzed. This is high-volume big data in no uncertain terms.

VELOCITY

Huge volumes of data are pouring in from a variety of different sources, and they are doing so at great speed. The high velocity of data means that there will be more data available on any given day than the day before – but it also means that the velocity of data analysis needs to be just as high. Data professionals today don’t gather data over time and then carry out a single analysis at the end of the week, month, or quarter. Facebook messages, Twitter posts, credit card swipes and e commerce sales transactions are all examples of high velocity data.

VERACITY

Veracity refers to the quality, accuracy and trustworthiness of data that’s collected. High veracity data is the truly valuable stuff that contributes in a meaningful way to overall results. And it needs to be high quality. If you’re analyzing Twitter data, for instance, it’s imperative that the data is extracted directly from the Twitter site itself (using the native API if possible) rather than from some third-party system which might not be trustworthy. Low veracity or bad data is estimated to cost US companies over $3.1 trillion a year due to the fact that bad decisions are made on the basis of it, as well as the amount of money spent scrubbing, cleansing and rehabilitating it.

VARIABILITY 

This refers to the inconsistency which can be shown by the data at times, thus hampering the process of being able to handle and manage the data effectively.

What is big analytics?

In sum, big data is data that is huge in size, collected from a variety of sources, pours in at high velocity, has high veracity, and contains big business value. Importantly, in order to extract this value, organizations must have the tools and technology investments in place to analyze the data and extract meaningful insights from it. Powerful data analytics makes processes and operations more efficient, and enables organizations to manage, discover, and utilize knowledge. So, get out there and start collecting data – but then make sure you invest in the technology and the people who can connect, analyze and extract value from it. Only this way will you realize the fifth V and keep your company competitive in the future.

Distributed storage and Hadoop

To handle the big data problem Hadoop is used. The distributed storage of the data or it is called the splitting of data to store at various other networks and system makes the handling of such a huge amount of data in most effecting way. It follows the master slave topology in which the one system is connected over the many system.

No alt text provided for this image

Hadoop is an open-source software framework for storing data and running applications on clusters of commodity hardware. It provides massive storage for any kind of data, enormous processing power and the ability to handle virtually limitless concurrent tasks or jobs.

Hadoop Distributed File System(HDFS).The Hadoop Distributed File System (HDFS) allows applications to run across multiple servers. HDFS is highly fault tolerant, runs on low-cost hardware, and provides high-throughput access to data. Data in a Hadoop cluster is broken into smaller pieces called blocks, and then distributed throughout the cluster. Blocks, and copies of blocks, are stored on other servers in the Hadoop cluster. That is, an individual file is stored as smaller blocks that are replicated across multiple servers in the cluster.

Each HDFS cluster has a number of DataNodes, with one DataNode for each node in the cluster. DataNodes manage the storage that is attached to the nodes on which they run. When a file is split into blocks, the blocks are stored in a set of DataNodes that are spread throughout the cluster. DataNodes are responsible for serving read and write requests from the clients on the file system, and also handle block creation, deletion, and replication.

No alt text provided for this image

GOALS OF HDFS

  • Fault detection and recovery: HDFS should have mechanisms for quick and automatic fault detection and recovery.
  • Huge datasets − HDFS should have hundreds of nodes per cluster to manage the applications having huge datasets.
  • Hardware at data − A requested task can be done efficiently, when the computation takes place near the data.

FINAL THOUGHTS!!

So, by going through such a information regarding the management of big data and how they are managed by the Hadoop distributed file system(HDFS), we can say that they are very effectively and efficiently managed . They require a huge investment to handle big data.

Thanking you!!


To view or add a comment, sign in

More articles by RUCHI PATHAK

  • Slack And its AWS Services

    Hello everyone, well I hope you all are doing great! This article is about the slack cloud computing and how it uses…

Others also viewed

Explore content categories