INTRODUCTION TO BIGDATA

Rithwik Reddy

Published Sep 17, 2020

BIGDATA

Bigdata is referred to as huge amount of data which company needs to store in their server for future reference.

Social Media corporations such as Facebook needs to store all their user data such as images,videos etc in their server so that the user or his friends can access them

STORAGE

There are two types of storage hardware which are enterprise storage and commodity storage.

1.commodity storage

Commodity storage is a device that is relatively inexpensive, widely available and more or less interchangeable with other hardware of its type.To be interchangeable, commodity hardware is usually broadly compatible and can function on a plug and play basis with other commodity hardware products. In this context, a commodity item is a low-end but functional product without distinctive features. A commodity computer, for example, is a standard-issue PC that has no outstanding features and is widely available for purchase.

2.enterprise storage

Enterprise storage is a centralized repository for business information that provides common data management, protection and data sharing functions through connections to computer systems.Enterprise storage has only single block with huge data storage capacity.

CHALLENGES FACED IN BIGDATA

One of the challenges faced in Bigdata is that corporations such as Facebook has an incoming daily rate of about 600TB. So they need massive enterprise storage devices which are very expensive and has some drawbacks such as limited volume and slow I/O speed.

To overcome the problem of volume and I/O speed we can use distributed storage

DISTRIBUTED STORAGE

Instead of buying enterprise storage which is expensive and has limited volume and slow I/O speed we can create a cluster with master-slave topology using commodity storages which are cheap.

ADVANTAGES OF DISTRIBUTED STORAGE

Establishing distributed storage is very cheap when compared to enterprise storage because we use commodity storage in distributed storage.

I/O speed depends on the number no nodes in the cluster which means that if the there are more more nodes in the cluster we achieve faster I/O

Even if we run out of the storage we can add another node as slave to the cluster to increase the volume

SOFTWARES TO IMPLEMENT DISTRIBUTED STORAGE

There are multiple softwares through which we can implement distributed storage such as Hadoop,Glusterfs,Ceph etc.But in today's world most of the companies use hadoop to implement distributed storage.

HADOOP

Hadoop is a free and open-source software created by Apache community to implement distributed storage using master-slave topology.In Hadoop master node is called as name node(NN) and slave nodes are called as Data Node(DN).Data node contribute their storage to name node in order to increase its storage.The transfer of data between the nodes takes place through HDFS protocol.When a large file needs to be stored,It is stripped into blocks and these blocks are stored in different data nodes

To view or add a comment, sign in

INTRODUCTION TO BIGDATA

Rithwik Reddy

More articles by Rithwik Reddy

Explore content categories

More articles by Rithwik Reddy

AZURE

OPENSHIFT

JENKINS

NEURAL NETWORKS

AWS SQS

ANSIBLE

HOW TESLA REVOLUTIONIZED SELF-DRIVING CARS USING AI

NETFLIX ON AWS

DOCKER

Explore content categories