BigData - a Problem or Technology ?
Hello everyone,
I am back again with a new article. So this article is all about BigData and I will explain you what is BigData and you will also got to know that it is a Technology or a Problem in the real industrial world.
What is BigData and Why it is called so?
We are living in 2020 and in today's world technology is changing day by day, the technical market expanding rapidly, every day new products are coming, giant companies launching new products and adding new features to their products and we are the users of these products and technologies. But why I am telling you this?
Because when we surf the internet or visit any site then there is the exchange of the data between that site and our browser. And if we upload some file like an image, document on that site then that file gets stored in the database or the data center.
Billions of people daily surf the internet, visit social media sites, and daily tonnes of data gets uploaded on the internet. Giant companies like Facebook, Google, Amazon, Linkedin daily receive tonnes of data daily and this data is not in GB's, it is in hundreds and thousands of Terabytes(TB) or Petabytes(PB).
The above is of Facebook's data center and you can see how much data they are receiving daily. And this data is so huge that you can't even imagine. And this data is technically known as BigData.
Big Data is a term used to describe a collection of data that is huge in volume and yet growing exponentially with time
For understanding how and why this BigData is a problem in the real-world , let us take the example of Google.
How much data does Google process ??
we all know Google as a search engine that knows everything whatever we search on it, it has the solution of all our problems but in today's scenario Google is one of the leading giant tech company and besides search engine it has other products too like Android OS, Google Cloud, Google Play Services, etc. So have you ever thought how much data google receives daily and how they handle it.
Google processes over 40,000 search queries every second on average, which translates to over 3.5 billion searches per day and 1.2 trillion searches per year worldwide.
Google currently processes over 20 petabytes of data per day.
Yes, you read it right 20 petabytes and this data is soo much huge in volume and complex, so it is known as BigData.
Now for storing any type of data we need some storage devices like Hardisk, Pendrive, etc. but these storage devices are common and have less storage capacity. So for storing this BigData we need the storage of very high volume capacity and one of the examples of such storage devices is Dell EMC and it can store thousands of TB's of data.
But the data Google is receiving is in petabytes and it is very important for every company to store the data. Because it not just the data, it is money and due to this data their business is running and growing. They have to analyse the data and then they use this data to improve their products, to know what their user wants, to do advertisements, etc. So from this you can think Why data is so important for the business growth.
And since here the data is BigData and we can't able to store all the data because it is larger than the storage capacity and this is one of the problem of BigData. And this problem is known as Volume as the data is larger in size than the Storage volume. And here volume basically refers as
Volume is the amount of data generated through websites.
And with BigData, we face the problem of volume.
Now if we are facing a shortage of volume then why don't we create the Storage Device of sizes in Petabytes(PB) or Exabytes(EB). But if we create Storage devices of such huge volume capacity than we face one more problem and the problem is of I/O Speed.
If we increase the Storage size then the Data transfer speed i.e., I/O speed decreases and for growing the business we don't only need data we need data on time. In addition to managing data, companies need that information to flow quickly – as close to real-time as possible. And this is known as the problem of Velocity in BigData.
“Velocity can be more important than volume because it can give us a bigger competitive advantage. Sometimes it’s better to have limited data in real time than lots of data at a low speed.”
So now we can say that BigData is not a technology, it is a PROBLEM or the STACK OF PROBLEMS in this technical IT world.
So for managing the BigData we use the concept of Distributed Storage.
Distributed Storage
A distributed storage system is an infrastructure that can split data across multiple physical servers, and often across more than one data center. It typically takes the form of a cluster of storage units, with a mechanism for data synchronization and coordination between cluster nodes.
Distributed storage is the basis for massively scalable cloud storage systems like Amazon S3 and Microsoft Azure Blob Storage, as well as on-premise distributed storage systems like Cloudian Hyperstore.
Distributed Storage works on the Master-Slave topology.
And for implementing Distributed Storage we use a software called Hadoop by which we can easily create the cluster shown above.
Hadoop
Hadoop is an open-source software framework for storing data and running applications on clusters of commodity hardware. It provides massive storage for any kind of data, enormous processing power and the ability to handle virtually limitless concurrent tasks or jobs.
Why is Hadoop important?
- Ability to store and process huge amounts of any kind of data, quickly. With data volumes and varieties constantly increasing, especially from social media and the Internet of Things (IoT), that's a key consideration.
- Computing power. Hadoop's distributed computing model processes big data fast. The more computing nodes you use, the more processing power you have.
- Fault tolerance. Data and application processing are protected against hardware failure. If a node goes down, jobs are automatically redirected to other nodes to make sure the distributed computing does not fail. Multiple copies of all data are stored automatically.
- Flexibility. Unlike traditional relational databases, you don’t have to preprocess data before storing it. You can store as much data as you want and decide how to use it later. That includes unstructured data like text, images and videos.
- Low cost. The open-source framework is free and uses commodity hardware to store large quantities of data.
- Scalability. You can easily grow your system to handle more data simply by adding nodes. Little administration is required.
THAT'S ALL !!!
I hope you can able to understand what BigData is ....
THANKS FOR READING !!!