How Open-Source Software Enabled Big Data for the Masses

David Corchado

Published Oct 21, 2015

This article was originally published in the October issue of CIO Review Magazine

The Power of Collective Intelligence

In organizational theory, there are two driving models for innovation. The private investment model – which allows inventors to claim financial benefits through intellectual property rights, and the collective action model, which states that in situations where market options aren’t optimal, innovators will collaborate toward a public good without expectations for material return. Open-source software (OSS), has both the benefits, and challenges of the latter. The strongest benefit of the collective action model is the harvesting of intelligence from a community to create a product that is better removed from the flaws or special interest of any one contributor. Contrast that with the rigid controls and special purpose of a commercially distributed proprietary product.

Nowhere is this comparison more obvious than the speculation we see from pundits attempting to forecast the server operating system of the future. Will it be Linux? Will it be Microsoft Windows? Most now realize the folly of those speculations because it was never a zero-sum game. Both are wonderfully useful server operating systems and as relevant today as they were five years ago. Public Clouds are changing things a bit, but beyond the cost savings, the technology selected for a specific task will continue to be driven by backward compatibility for existing systems, and very often by the skills of the software development team. The exception to this standard is when it comes to selecting a software stack for data aggregation and processing –which hands down, is being dominated by the open-source community.

“By 2016, at least 95% of IT organizations will leverage nontrivial elements of OSS technology within their mission-critical IT portfolios, including cases where they might not be aware of it — an increase from 75% in 2010.”-Gartner

With even the U.S. government creating policy for the use of open-source tools, there is no longer a question about OSS going mainstream. In fact, even among the biggest enterprise, there is a battle going on to not only demonstrate progression with open-source software, but also to be the patron of OSS projects being fed into the community. Companies like Facebook and Google are using publicly available source code to power their critical infrastructure, and at the same time, making products they build internally available through the GNU general public license. Some successful examples of this patronage is the Hadoop ecosystem originally conceived at Yahoo, and the Cassandra NoSQL database developed by Facebook. The most forward-looking companies are not content to just release source-code into the wild, but are instead determined to continually out innovate within the open-source community. Case in point, a tool my group recently started working with Presto -- originally conceived by Facebook. Presto is a nearly real time query engine created to outperform Apache Hive. With its distributed query processing, it eliminates latency and disk IO that is common with traditional MapReduce jobs. Presto is a great example of Facebook showing that they are actively leading within, and earning the respect of the OSS community.

How Big Data Came to SMB's

Mining insight from massive amounts of data has always been challenging, but it was never impossible. All you had to do was ETL your data into a specific schema and inject it into an enterprise relational database. This model worked great and IBM, EMC and Oracle are masters of building sophisticated software to serve this purpose. The problem with this approach came when there was massive amounts of data, in different formats, stored in different proprietary database products. For anyone other than the very well capitalized, transforming and connecting that data from disparate systems in any meaningful way was nearly impossible. Under this model, the technology architecture required a lot of moving parts and time to process. In addition, the database software licensing model was largely predicated by the number of cores running on the server and sometimes on how the data would be used (internal vs. external). Scaling this model to multiple servers rapidly became not only expensive,but a logistical license nightmare to deal with. If you've served any amount of time as an enterprise IT leader, managing your compliance with an Oracle or Microsoft enterprise agreement, I bet there are times when you wished you employed a full time license administrator.

Consider the licensing challenges of managing to the number of cores of a server, the number of data sources, or to how content is being shared; and map that against a need to connect historical data streams that exist in multiple places and in different formats and present this to a potentially anonymous user base. What would that cost through an Oracle licensing model? The optimal big data architecture by its nature reaches into very long history, is distributed for scalability, performance and fault tolerance. When done properly, is there even a good way to track the myriad options for how this information is being used? Do the math, and it becomes clear that without the flexibility brought by freely distributed open-source software, big data would still be outside the financial reach of most. It's because of this cost savings that the talk around big data has reached the fever pitch we see today. Most of the talk we hear today is trumpeted on Hadoop and horizontally scaled commodity hardware. Hadoop and x86 equipment have made for for a match that even the smartest enterprise OEM's worried during an ROI comparison.

More Than Another Arrow in Your Quiver

Since the focus of this article is OSS I would be remise if I didn't advise you of other factors to consider before tossing out your Oracle/Netapp combination for Hadoop and x86. Always do your OSS analysis within the context of other soft cost you likely incur. Even with all this licensing flexibility, you should not assume that open-source projects will automatically translate to a lower TCO. While there are typically savings on licensing, there are often higher costs for skilled labor. When mapped against the cost of closed source software, OSS tends to be heavier in operating cost (it is likely that OSS staff will require a different and often very nascent set of skills than that required by private packaged software). Packaged software often comes with more predictable support options; and even among the enterprise vendors offering support for OSS, I have always found these offerings fall short when compared to the support provided by Microsoft, IBM, EMC, or Oracle for their own products. With the need for specialized skills, comes the need for regular training. Managing the source code and working under a GPL for OSS requires a distinct legal review, separate from internal and proprietary software development efforts, particularly if you are going to distribute OSS software to third parties. It is also important to remember that among OSS projects, the maturity of the product and corresponding documentation is often driven by the size of the developer community. Therefore, you must ensure that you assess the feasibility of each OSS projects on its own merits. Even with all this, the OSS ecosystem will continue to drive the future of data processing, and indeed is the very reason this Big Data buzz exists in the first place.

If you have questions on this article, or even have a general question about introducing Hadoop, Presto, Cassandra or other open-source projects into your infrastructure feel free to ping me here on Linkedin.

Yana Cheredina 2y

Thank you for sharing this, Dave 🤗

Arif Somji 9y

Excellent article David!

Ali Bouhouch 10y

Great read. Congrats Dave!

1 Reaction

See more comments

To view or add a comment, sign in

How Open-Source Software Enabled Big Data for the Masses

David Corchado

More articles by David Corchado

Others also viewed

How I Made an Unreliable External API Reliable

Introducing: the Open Source Field Guide

How Big Tech Checks Username Availability in Milliseconds

History Rhymes: When Infrastructure Is Missing, We Write Philosophy

Ceph and Swift: Frenemies!?

Using Azure & GitHub As A Data Nonprofit

Five Tips to Optimize Your Backend

SameSite Cookie and Infinite Redirections in SAML Authentication

First edition of Fierce Feed!

Goodbye, Open Source

Future of Open Source in Enterprise

The Role of Open Source in Big Data

Open Source Big Data Tools

The Impact of Open Source on Digital Innovation

Open Source Software and Data Privacy

Leveraging Open Source Communities

How Open-Source Models can Challenge AI Giants

Explore content categories