A Practical Approach to Big Data
Big data is complicated. Fortunately we have quite a few cloud services on the market to make it easy. Also if you choose good open source projects, they will help you to get into big data world with low entrance bar. This artical will list some easy and low cost options for you to onboard with big data.
Please note this artical will not cover real-time processing or streaming processing. To save your time, you are suggested to stop reading if you are searching for those solutions.
To choose good big data tools, you'd better answer two questions:
1. How big your data might be?
2. How fast you want to query data?
The first question determines what big data storage you need, and the second question determines what execution engine or query engine you need. Following are some scenarios based on different answers to the two questions.
1. Huge data size (hundreds of TB), High tolerable query latency (in hours)
This scenario is usually batch processing. One good solution is: AWS S3 + Apache Spark. You can run Spark jobs to execute query against AWS S3, save result to CSV file, and use Excel to analyze or visualize data.
2. Moderate big data size (tens of TB), Low query latency (in seconds)
This scenario is usually interactive query. One good solution is: AWS Redshift + Tableau. Redshift provides low query latency, and Tableau provides good data visualization. One thing to remind if you use this solution: plan ahead for your Redshift cluster capacity. It is a little painful when you scale up in Redshift.
3. Moderate big data size (tens of TB), Moderate query latency (in minutes), Low cost
This scenario applies when you do not want to spend much money on AWS Redshift or Tableau. You will need some developer familiar with big data so you can set up in-house big data cluster. One good solution is: Apache Cassandra + Presto Query Engine + H2 Console (from H2 Database Engine).
Cassandra provides high availalbe big data storage, and is easy to set up. Presto provides a distributed SQL query engine on top of Cassanra, with JDBC support. Then H2 Console is a simple and nice web UI to query data via JDBC. Without any coding work, you can combine these three tools together and provide an E2E big data story in your company.
As you see, the tools you choose will be different based on your specific big data needs. Hope this article gives a quick idea. Please feel free to contact me if you have any further questions or would like to provide feedback.
Acknowledgement: the header image is from FreeDigitalPhotos.net by franky242.
Thay was a litle harsh so maybe more decisive data collection is a better alternative:)
Maybe Big Data could catch misspellings in post headlines. Oh no wait, we have spell checkers for that. :)
very good, thank you!
Excellent Illustration of Big Data.
Very useful! Tnx