Big Data – A Cheat Sheet Approach
Generally business stake holders, technologists and et al are intrigued and intimidated by the term Big Data. Especially when Credit Unions look at this, the immediate concern is – Do I have to spend millions to get a Big Data initiative implemented? What is the RoI for investment and where does this come from? All these are great questions to ask and the objective of this article is to place some facts, allay some fears and above all provide a methodology to assess the infrastructure, software tools necessary that would help in the culmination of a Big Data implementation.
To me, Big Data is just a marketing terminology used in conjunction with analytics. The term and relevance of Big Data is intimately coupled to size of the data an organization manages and the kind of insights the organization would like to draw given the data – succinctly put, “What are the KPIs we are interested in?”. Big Data for Community banks and Credit Unions may mean very different compared to the Big Data for multi-national banks such as Citi, Bank of America and the like. It is the volume, variety and velocity of data that dictate the complexity of the system which eventually should drive the appropriate technologies and tools for implementation. The foundation piece is to bring all the data (be it any format) in one single source so that they can be used for drawing insights. While there are multiple technologies, I have specifically concentrated on HDFS/Hadoop as a technology considering the eco-system support available in terms of tools and human capital. The high level pictorial below is like a tree traversal list and what decides the investment and subsequently the benefits of Big Data in a Credit Union is dependent on the answers to some pivots in the tree. This pictorial cheat sheet provides a reasonable estimate of what Big Data implementation involves in terms Hardware and Software infrastructure based on the volume of data and variety. It is not my intent to provide a solution but guide any Credit Union towards a reasonable estimate of what would likely be an implementation without being confused about the plethora of technology choices that are available in the market place today.
As to the cost of infrastructure, we can bucket them into three major categories such as Hardware, Software and Support. Today, branded hardware from vendors like DELL, Cisco, HP are very inexpensive compared to the past decade. Besides, if the organization has the appetite to embrace white labeled boxes, savings could even be better. As for HDFS/Hadoop software, it is Open Source and adds no cost. The only cost is support and this is again very inexpensive compared to the commercial software licenses. Add to this, the cost of implementation and in Hadoop world the resources are available at a very affordable price because of top vendor support like Cloudera and Hortonworks. With a proper technology strategy, I can comfortably conclude that Community banks and Credit Unions can implement HDFS based solutions with reasonable cost where the benefits are bound to outweigh the cost and efforts.
Hope, I was able to provide a very high level guideline and allay the apprehensions of the stake holders who are battling with decisions in their everyday life. Trust me, I have been in such circumstances very often and I have learnt to confront these issues and find solutions instead of contemplating. Needless to say that it has paid rich dividends.
Great article. I agree with you on most pointers, but in my mind, relevance of Big Data (as a technology and not branding on which I am in line) is not just compute vast volumes of data but more importantly the ability to disintegrate and integrate data from sources which in the traditional milieu will not talk to each other and to make then work together. So, in that sense, Big Data becomes relevant as a translation layer (for handling un-related data) and then parallel computational layer (in map-reduce). Also, I assumed when you mentioned all data in one-source , you meant all (relevant) data per logical requirements of enterprise KPIs because one of key challenges of Big Data implementations is 'handling data noise' which make is complex and bulky. When 60% of data is most cases is noise, one needs to prudent in articulating what are the needs and wants from the data before devising an ingestion strategy or HDFS strategy. Hope it compliments your thoughts. Again, great article and keep pouring in. Regards.