Big Data Technologies - How to leverage them?
With all the hoopla around the buzz words of Big Data Technologies, how do you ensure that the technology stack selected for your organization is the right one? What should drive these decisions? Should you consider the industry standard? Is there a silver bullet?
Finalizing the right technology stack isn't just about industry standards. Of course, these standards can provide you some insights as to why a particular stack, but before you zero in on anything, focus on these points as they are the real drivers on deciding the path you choose.
Understand your Data's Contours
Big data technologies, whether HADOOP or an MPP database, have distributed architecture and therefore, work efficiently if you manage to load data as evenly as possible across all nodes. Let's focus on HADOOP - the options you have to process data in a HADOOP platform are to either use map-reduce or use in-memory technologies. This is where understanding your data's contours helps since it will be a guiding factor as to which technology to choose. For instance, using SPARK (an in-memory technology) on a huge volume of skewed data set to perform ETL, where you need to frequently join a lot of data entities (specifically outer joins), could end up taking more time than a custom map-reduce or HIVE. Many developers have had this observation where the entire SPARK job is waiting for that one last job to finish, the job that is processing more data than the rest and eventually runs out of memory, throwing an "out of memory" exception.
Once you've thoroughly analyzed such a data set to understand optimal partition points, you can then use technologies like HIVE that can gracefully handle skewness and joins in such use cases. The key here is, read once and process for all types of desired outputs, for instance, aggregation at multiple data points. Based on my observation, I have seen HIVE perform better than SPARK here, again, a very specific use case (high volume, skewed data set and numerous outer joins). No doubt, in-memory systems are always faster than disk based systems but the nature of your data also plays a crucial role in deciding one way or the other. You could argue that to counter memory related errors you could throw in beefier machines or increase the number of nodes, but that would increase your cost too. If cost is not an issue then by all means you should explore your pinnacle (provided you don't go over your sweet spot) but unfortunately, in most organizations we need to keep a close eye on the cost as we invest in big data infrastructure, whether on-premise or on cloud.
SPARK or PRESTO (based on my observation, I feel PRESTO has a better optimization engine) are good use cases for interactive querying of data. Since you would not be churning huge amounts of data as in an ETL process and will be querying only on a subset of data, the use case for using in-memory technologies for interactive querying becomes more attractive. Do a PoC to determine which particular in-memory technology works well with your data and your query needs.
Understand how the Business wants to consume the Data
Real time or batch? Raw or cleansed and enriched with master data? What kind of questions the business wants answered? Keeping these in mind, now focus on the data ingestion process. Streaming data using KAFKA or Kinesis (if you were on AWS) seems like an excellent approach to ingest real time data and you would like to jump on to it but can your business really work with raw data? What if the data is useless to them without enriching it first? Now think over it again, will a streaming platform alone solve your problem? The answer is no. You have to build "ETL" to micro batch the cleansed and enriched data into your system. You could go for "Lambda Architecture" here, provided your business feels that they can make sense of the most current raw data till the batch process loads enriched data, else don't even bother.
A good use case for continuous ingestion of raw data to provide immediate value to your business/customer is for a stock brokerage firm. All they want is real time market value of stocks directly pulled from the stock exchange and reported to the business. An equally bad use case would be to try to make sense of data with sales or any other metric without any context (zero enrichment with master data).
Next, understand the way your users want to consume the data, do they want to slice and dice or is it metric oriented that changes every few months. If they want to slice and dice then consider using an MPP relational database to create a star schema. No other big data platform is matured enough to provide such flexibility to the users, besides, it's easier for visualization tools to latch on to relational databases compared to other options. Use NoSQL or schema-on-read databases where you need to store predetermined metrics. Most of these databases are key value based and provide faster read and write of such data without the creation of predesigned schemas.
There are a breed of "column family" NoSQL databases such as Cassandra and "documentDB" NoSQL databases such as MongoDB. You could argue that these could be used for slicing and dicing of data, however, the architecture of these databases demand that dimensional attributes be denormalized into transaction entities in order to avoid "joins" (which is so fundamental to slicing and dicing) and boost performance. This also leads to creating multiple flavors of the same data set, overall a maintenance nightmare and not a good design. I am not discounting these databases as they have valid use cases elsewhere like a heavy transaction system (where the business is okay with CAP over ACID - this is a separate discussion and outside the scope of this blog :)) but not when it comes to slicing and dicing.
Organize your Data
This is the most important aspect and often the most neglected. Many organizations believe that you can exploit a big data setup by just throwing data in it and let the setup do the magic. It does not and will not work that way. Such experiments often result in a distaste for such technologies and lead the stakeholders to believe that it is yet another "hype".
How do you determine the best organization for your data? The answer lies in the previous points. Once you've thoroughly analyzed the data and understood the business needs, you should be in a position to model the data (as discussed earlier, star schema or key-value) and determine partition points that will help in optimal processing of data. To accommodate for joins, distribute smaller data sets in all nodes. This is applicable for HADOOP as well as MPP databases. In essence, design your file structure in HDFS or S3 (these days S3 is being rapidly adopted as a pseudo HDFS if on AWS cloud) to clearly define partition points that will help in reducing skewness, collocating joins and avoiding hot spots, thereby boosting performance. In the database world, you would need to come up with a proper physical data model that addresses the same things as mentioned for the HADOOP platform.
Finally, consider data compression as it reduces I/O and columnar data format as it avoids reading an entire row in memory when only a few columns have been queried. Store files in ORC or PARQUET file formats. Both support columnar storage of data and hence, boost performance. Needless to say, when four columns are queried from a pool of thirty columns in a multi billion record set, the amount of data actually read into memory (which is equal to the size of those four columns) is significantly less than the size of the entire data set in memory, had it been a row oriented file storage. Likewise, there are numerous MPP columnar databases that support excellent compression encodings such as Redshift. Use databases of such architecture to reduce the overall size of the data set, enable columnar storage and boost performance.
Key Takeaways
- Thoroughly analyze your data
- Determine efficient partition points in your data set
- Given your data's contours determine whether map-reduce works well or in-memory technologies work well for your ETL loads
- Use in-memory technologies for interactive queries
- Understand business consumption of data and hence, decide on real time Vs. batch processing and raw Vs. enriched data
- Slicing and dicing are better done in a relational world, hence, consider using a relational DB (for processing big data, should be a relational MPP columnar database) as opposed to other big data platforms
- Organize your data well, both in HADOOP file and/or MPP database ecosystems. Keep in mind that both are distributed systems and therefore, will react the same way to adverse or favorable conditions
- Consider data compression and columnar data formats to boost performance significantly
Hope I was able to provide some context on the usage of Big Data Technologies. This is an evolving field and I am sure that certain observations here could change over time for the benefit of all.
Oh! before I conclude, there is no silver bullet!! Understand the problem and then select the right technology to solve that problem.
Very well written. Not many can articulate it in such a simple language. Congrats !!!
Very well written!
Great article Sujay ! Very informative.