Big Data 4 Big Corp?
I sometimes hear it said that Big Data is a fad and it doesn’t work in big corporations. Some of the recent evidence seems to back this up: millions are being spent on Big Data projects, often with little to show for it. Big Data can seem remote from business impact. Perhaps Big Data is just for tech start-ups? But there is a host of large firms whose business model is founded on Big Data (Amazon, Google etc). If it works for them, why not other big corporations?
Why do some Big Data projects fail while others are successful? These are my thoughts, based on my experiences in a number of corporations and on recent conversations I have had with industry participants.
Big Corp IT suffers from a skills deficit in Big Data technology
Big data technology changes all the time with some applications/projects making major releases quarterly. Documentation is sparse and knowledge tends to remain in the heads of project contributors and start-up pioneers.
There may be a few people in house who can set up a cluster and load Hadoop and maybe one or two will get it to work in a vanilla use-case. But to configure it efficiently you need people with practical experience of the idiosyncrasies and edge cases of Hadoop (or whatever software you are using). We actually have a couple at Barclays, but that is unusual . The chances are that these skills are not in your organisation yet. Only the basics are written down and there are few courses, which are often out-of-date. Training your staff can only get you so far. The best Big Data people tend to be passionate autodidacts – you can’t go on a training course to learn that.
If you attempt a Big Data initiative without the right skills you will fail. If you can’t find the right skills and aptitudes internally you will need to recruit from outside.
Infrastructure is easy, Data is hard
In my experience Big Data raises more issues about data than about technology. Because it combines storage and computation technologies there are always challenges related to data security, data privacy, audit and governance which are not faced by conventional architectures which separate storage from computation. This is in addition to the revising and documenting of schemas that tends to accompany new infrastructure, not to mention the questions posed by unstructured data.
But Big Corp IT teams tend to focus on the technology infrastructure first and deal with the data afterwards, so that when the infrastructure is delivered it can’t do anything until data issues are sorted out. The data issues will take at least as long to resolve as procuring and implementing the infrastructure. You need to run both processes together.
There is no Magic
Only a few years ago, Big Data analytics required a team of Google geniuses who could write map/reduce jobs in Java. Now a plethora of vendors promise applications (often only half-built) to make everything magically easy. But the promise of these systems distort the expectations of business managers and seldom work as well as expected.
Here’s why. To deal with the bigness of Big Data, you break a problem up into pieces, compute the answer to each piece in a different place and then bring them all back together again. Optimising this process turns out to be quite tricky and often relies upon an understanding of the particular characteristics of the problem itself – It is difficult to generalise. The cluster needs to be configured carefully to manage resource constraints, avoiding disk writes, and the code needs to be designed to minimise the impact of bottlenecks such as sorts and reduces. Getting this wrong can result in compute times orders of magnitude (hundreds of times) slower and because of the large volume of data involved this can translate into run-times of days, making analysis impractical (you’d be better sticking to conventional “small data” sampling analytics). For example my team was able to build a query engine that was many hundreds of times faster than Hive because we understood the particular domain and were able to optimise accordingly.
“Magic bullet” applications are often benchmarked using simple tasks or ones that are pre-optimised to a domain. If you use them for real world problems without properly trained people on a sub-optimally configured cluster they can be much slower than the benchmarks. So it’s no wonder that Big Data can return disappointing results.
Don’t “do it right first time”
Perhaps the most fundamental reason why Big Data is seen to fail in Big Corp is that although Big Data is a new and developing field, Big Corp IT can’t resist the old mantras of scale, enterprise solutions, and future proofing. But it is often difficult to envisage what the first successful Big Data use case will be, let alone how it will be used in 5 years’ time. It is all very well to spec a fully resilient production-ready cluster but if it doesn’t do what you need then it will be an expensive folly. And it’s this bit that’s difficult to envision up front.
Worse, in order to justify the investment, Big Corp IT often invents some spurious borderline-undeliverable project like an “enterprise data lake” which will underpin the whole investment. These take many months to deliver and often arrive undercooked and irrelevant. By the time the solution is delivered (if it ever is) the project is discredited and stakeholders have moved on.
Be Agile. If you don’t know exactly what you are going to use Big Data for, start small and move fast. This can happen in Big Corp just as well as in a start-up. To do this requires Big Corp IT to question the sacred axioms of scale and reusability. The inefficiency of doing something slowly and wrong dwarfs the supposed savings from economies of scale and specialisation. Big Corps that do this well learn from start-ups, using small semi-autonomous multidisciplinary teams to build capability quickly and deliver new products to learn how big data can work in their organisation. Only then do they spend time and money implementing the an enterprise technology solution.
In summary
- Get the right people (always)
- Focus on the data as much as the infrastructure
- Set realistic expectations
- Start small and keep it simple
Sounds easy when you put it like that.
There is often one more problem in older corporations, namely defensive company culture that can derail Big Data efforts. Big Data and advanced analytics in general often threaten the established way of thinking and status quo. Sharing data with new Big Data teams is perceived as compromising data owner's job security and giving away power and influence. This is one of the reasons that data lake projects fail.
Nice article Harry! The right people are key (both attitude, work ethic, and knowledge). Sometimes people can look outside their companies to "plug that hole" on a temporary basis till things solidify and get going. Many times large initial Big Data plans can be like battle plans- at first contact, they are thrown out the window
Wonderful article. One other thing I would like to add is to expand the current BI/Analytics data management & governance practices to Big data implementations so that it helps streamline both the data initiatives. It ensures that there is a definitive and trusted data practice overseeing big data programs and that it is governed efficiently.
Excellent article. I'm particularly impressed by your focus on the importance of data-related skills (instead of simply building infrastructure). It's my experience that a fixation on hardware has ruined many 'big data' initiatives.