The Choice Awakens: Optimization vs Open Source with Hadoop

Tony Baer recently wrote a new article about MapR, and about their continued journey to optimize their infrastructure for batch and streaming workloads so that a single cluster could support all these workloads, perform well and have a lower TCO. Cloudera has focused more on helping the open source community innovate, and selling tools for optimization, monitoring, management and data governance. Hortonworks has taken more of an OEM and partnering approach. While MapR does support the standard Hadoop APIs, making migration to and from MapR easier, it raises the question yet again; should I go open source or "non-open source" and why?

Just like Star Wars has many sequels, it's not the first time we've had to make this choice. Operating systems (Linux), SQL databases (MySQL), Web Servers (Apache HTTP Server), App servers (JBoss), ESB (Mulesoft), and data integration (Talend) are some of the examples. But there has been one big change. Hadoop was arguably the first big example where the innovation came from open source first, from some of google's work that was turned into open source at Yahoo, which then shifted to Cloudera and Hortonworks. In the previous examples, the open source equivalent was either helping commoditize established features and lower costs, or open up features to different communities and buyers. So if you wanted the latest you started with open source.

But, like all the previous technologies, when customers demanded a lot of innovation and new capabilities, proprietary implementations thrived. Oracle, Microsoft and IBM's database businesses are still huge. What Tony rightly points out is that there's a big need for performance, low cost of ownership and a host of capabilities including data governance that vendors need to provide that are not open source (today.) While analytics use cases (including extract transform load or ETL) were solid initial use cases for Hadoop, it is streaming use cases, including IoT that promise to have a huge impact on companies today. Why? Because improving the customer experience, optimizing the supply chain, even autonomous vehicles all depend on taking in huge amounts of real-time data and making good decisions really fast. And that requires a host of technologies and capabilities that not only processes and acts on data really quickly, but also separately looks for new patterns on the side, and learns. I call it the hare brain and tortoise brain (which I first heard from John Cleese years ago :-)).

For the hare brain, you need new forms of real-time streaming integration to feed the data, you need messaging underneath it, event processing/streaming analytics on top of Spark or other technologies to make sense of the data and act fast. Your memory for acting fast requires new ways to store relationships that help you make decisions, like graph technology. For the tortoise brain, you need a great way to store the data combined with analytics tools (like Spotfire, Tableau and Qlik or newer noSQL tools), analytics languages like R, and newer machine learning and artificial intelligence technologies. This all needs to work in a way that allows you to feed the learning back into the hare brain.

So here's the big question that has led us to our choice. How do you add features AND make this all work together AND perform much better? The open source community, and all the vendors around them, do a great job of adding features. And each technology does quickly adopt the many of the latest features of the other technologies around them. But the achilles heel of the open source community is that they focus on their technology and their technology alone. That's why MapR focuses on packaging up all these technologies in one installer, flattening out and optimizing stacks, and optimizing end-to-end performance, management and TCO behind the Hadoop APIs, and doesn't just package up open source technologies.

Choice is good. Some companies will choose open source, or Cloudera, or Hortonworks through an OEM, or MapR. But there is so much more innovation needed to help improve the customer experience, or make companies more real-time and intelligent, and so much more simplification for IT organizations that MapR, and going beyond Open Source will continue to thrive.



Great Article Rob Meyer. Real time streaming into Big Data and native transformation in Hadoop without having to use proprietary engines is the key. #OracleGoldenGate is the fore front of streaming Ingestion and #OracleDataIntegrator is providing native transformation capability without having to write. Single line of code is solving real customer problems... As the space is more fluid and dynamic than ever, it is important to choose technology agnostic tools to derive more value from data. #OracleDataIntegration does that for you.

To view or add a comment, sign in

More articles by Rob Meyer

Others also viewed

Explore content categories