BigData and Hadoop......
The world we live in is data oriented. Everyday we deal with large quantity of data. In today's modern era of computer and technology the world has come closer, anyone can communicate with anyone overseas using social media. Social media has become part of our life, we share our memories, happiness, sorrow, etc through social media. We share text, images, videos and many other kinds of data on social media and social media stores this data for a long time per user. E.g. You uploaded your photo 3 years ago on Facebook, you may want to re-visit your memories through that photo, so Facebook can't delete it. Now, Imagine the similar situation for every user on Facebook. I think you may have got idea how huge this data becomes! and Facebook stores it on their server for almost Eternity!!!
Data that we upload on social media like Facebook, LinkedIn, Instagram, etc is huge in quantity. According to one survey, everyday, we upload as much as 500+ TB's of data on Facebook. To understand how huge this data is, take a glance at the following image.
This kind of huge data is known as Big Data. All social media like Facebook handle and manipulate this huge data.
Big Data is a problem!!!
Yes, you read it right. Big Data is a problem, it is not a technology (Many newbies may think it is a technology). The first and very important problem that comes to mind is Storage.
Consider a example, a hard disk with 1 TB of capacity costs somewhere around Rs. 4000, Social medias like Facebook store 500+ TB data on daily basis. To store this huge data, one would have to pay 4000 * 500 i.e, approx. Rs. 20,00,000 on daily basis but no one can afford this huge amount and when profit is comparatively very less. So, How do we store such a huge data ?
Second biggest issue is processing speed. More the data you have more the time computer takes to process it. Processing speed of today's computer is definitely not enough to process such a huge data all at once. So, How can we process the data in minimal time ?
Because of problems as such handling and manipulating data seems moonwalk.
Solution to Big Data problem.
If we re-visit our problems at hand regarding bigdata they are all about insufficient storage and slow processing speed. To solve these issues, we can either use more resources for storage and procesing (not an efficient way, needs loads of money) or we can somehow reduce the quantity of data to store and process.
Rather than storing and processing huge data on the same system, we can split the data in pieces and then store it or process it on more than one system. This concept is known as Distributed Computing.
The system forms a Cluster of different systems working as shown in diagram. Social media like Facebook uses this concept to handle the Big Data using many clusters.
Facebook and other social medias use this concept using a technology known as Hadoop. So, Hadoop is technology that implements distributed systems.
Technical solution to Big Data : Hadoop
Hadoop is an open-source software framework for storing data and running applications on clusters of commodity hardware. It provides massive storage for any kind of data, enormous processing power and the ability to handle virtually limitless concurrent tasks or jobs.
Hadoop uses the concept of distributed computing.
Why is Hadoop important?
- Ability to store and process huge amounts of any kind of data, quickly. With data volumes and varieties constantly increasing, especially from social media and the Internet of Things (IoT), that's a key consideration.
- Computing power. Hadoop's distributed computing model processes big data fast. The more computing nodes you use, the more processing power you have.
- Fault tolerance. Data and application processing are protected against hardware failure. If a node goes down, jobs are automatically redirected to other nodes to make sure the distributed computing does not fail. Multiple copies of all data are stored automatically.
- Flexibility. Unlike traditional relational databases, you don’t have to preprocess data before storing it. You can store as much data as you want and decide how to use it later. That includes unstructured data like text, images and videos.
- Low cost. The open-source framework is free and uses commodity hardware to store large quantities of data.
- Scalability. You can easily grow your system to handle more data simply by adding nodes. Little administration is required.