INTRODUCTION TO BIGDATA
BIGDATA
Bigdata is referred to as huge amount of data which company needs to store in their server for future reference.
Social Media corporations such as Facebook needs to store all their user data such as images,videos etc in their server so that the user or his friends can access them
STORAGE
There are two types of storage hardware which are enterprise storage and commodity storage.
1.commodity storage
Commodity storage is a device that is relatively inexpensive, widely available and more or less interchangeable with other hardware of its type.To be interchangeable, commodity hardware is usually broadly compatible and can function on a plug and play basis with other commodity hardware products. In this context, a commodity item is a low-end but functional product without distinctive features. A commodity computer, for example, is a standard-issue PC that has no outstanding features and is widely available for purchase.
2.enterprise storage
Enterprise storage is a centralized repository for business information that provides common data management, protection and data sharing functions through connections to computer systems.Enterprise storage has only single block with huge data storage capacity.
CHALLENGES FACED IN BIGDATA
One of the challenges faced in Bigdata is that corporations such as Facebook has an incoming daily rate of about 600TB. So they need massive enterprise storage devices which are very expensive and has some drawbacks such as limited volume and slow I/O speed.
To overcome the problem of volume and I/O speed we can use distributed storage
DISTRIBUTED STORAGE
Instead of buying enterprise storage which is expensive and has limited volume and slow I/O speed we can create a cluster with master-slave topology using commodity storages which are cheap.
ADVANTAGES OF DISTRIBUTED STORAGE
Establishing distributed storage is very cheap when compared to enterprise storage because we use commodity storage in distributed storage.
I/O speed depends on the number no nodes in the cluster which means that if the there are more more nodes in the cluster we achieve faster I/O
Even if we run out of the storage we can add another node as slave to the cluster to increase the volume
SOFTWARES TO IMPLEMENT DISTRIBUTED STORAGE
There are multiple softwares through which we can implement distributed storage such as Hadoop,Glusterfs,Ceph etc.But in today's world most of the companies use hadoop to implement distributed storage.
HADOOP
Hadoop is a free and open-source software created by Apache community to implement distributed storage using master-slave topology.In Hadoop master node is called as name node(NN) and slave nodes are called as Data Node(DN).Data node contribute their storage to name node in order to increase its storage.The transfer of data between the nodes takes place through HDFS protocol.When a large file needs to be stored,It is stripped into blocks and these blocks are stored in different data nodes