Halecium - A Next Generation of NiFi

Overview

 Moving data from producer to consumer or source to sink has been one of the main obstacles to having data where you need it, when you need it, and how you need it since the beginning of computing. Whether that meant moving punch cards or large magnetic tape reels around, or transmitting Internet of Things (IoT) data around the internet, the issue is the same. How do you get the data from its source to somewhere it can be used. There have been many different methods of solving or getting around this issue.

From the early days of computing until now, one method of data movement has been to literally move the data. This could involve shipping large reels, hand moving flash drives, or sending a hard drive via a package delivery company. These, obviously, can cause major lags in data availability where it’s needed, although they do offer the certainty of data delivery (unless your hard drive gets lost in the mail). Alternately, large integrated Extract/Transform/Load (ETL) platforms were developed to connect directly to remote sources and extract data, transform the data to a usable format, and load the data into data sinks so that it was ready to be used later. While these systems were a huge step forward in the evolution of data movement, they are large, expensive, and complex beasts that require dedicated personnel to implement and maintain as they have tended to be inflexible without major capital outlays. Also, their size limits their flexibility on deploying to the edge of data networks, and so they mostly require a centralized data infrastructure with dedicated hardware. Their complexity also means that the data again lags in availability (due to transfer and processing times) and delivery can be fatally interrupted by intermittent connectivity or constrained data pipes.

Next, solutions (such as Splunk) emerged that attempted to hybridize centralized data movement infrastructures to move some of the processing to the edge of the data network. While this was also an incremental improvement, cost was still an issue, the range of remote processing was limited, and this approach still required a centralized hub to manage the remote feeders and do the heavier processing.

The most recent evolutionary step in data mobility came in the form of Apache NiFi (NiFi). NiFi was originally a project named Niagara Files under the National Security Administration (NSA) that was in production use for many years. The NSA open sourced NiFi under the Apache Foundation, and through community involvement and the dedicated team at the shepherding company Onyara (now part of Hortonworks), NiFi was brought to market and has proven to be a revolutionary tool in allowing for data to be collected at the edge, lightly transformed, and then moved to where it needs to go, when it needs to be there. NiFi is not an ETL tool, it is more an EtL tool, with only minimal transforming (mainly filtering, parsing, forming, etc.) being handled in favor of quickly and securely moving the data.

The next evolutionary step is Halecium, which will be introduced below.

 

NiFi Advantages

 NiFi is a data movement tool based in Java that allows for data to be securely moved from point to point without data loss due to connectivity and bandwidth issues. NiFi also allows prioritization of data so that if connectivity and bandwidth are issues, you can be sure to get the data you need most first and the rest will come when the connection allows for it. This, along with running on most any machine that runs a JVM, allows NiFi to be pushed to the edge of the data network so that data can be handled and routed at the source and not at some centralized location. The result of this is to get your data faster and exactly where you need it, thereby increasing the value of time sensitive data and reducing your data infrastructure costs.

As part of NiFi’s security focus, NiFi has an innate site-to-site communication protocol that allows for multiple NiFi instances to securely speak to each other as if they were the same instance. This allows for secure bidirectional data flows, clustering of NiFi instances, and the simplification of data flows by building on flow on another. NiFi also has fully integrated data provisioning that allow for cradle-to-grave tracking of when, what, where for data from the time NiFi gets it to when NiFi releases it (more on this in disadvantages).

NiFi also uses the concept of processors to define functions and allow users to build up functionality through a graphical interface. Each processor has one function, and a processor can be added to handle any E, light t, or L functionality needed (i.e. connect to a database, parse data, send data to a custom application, etc.). Being that NiFi is open source, if the processor you need doesn’t exist, just add one (again, more on this in disadvantages). This allows for NiFi to connect to many sources and move data to many sinks.

NiFi has many, many more advantages and it is advised for those who are interested to visit the Apache NiFi or Hortonworks Data Flow (HDF) websites for more details. We will not go through a full list of NiFi functionality here. Cloudera has also created Stream Sets (which is very similar in concept to NiFi), however it is not yet mature, is proprietary (and potentially expensive), and will also not be covered here.

 

NiFi Disadvantages

 NiFi is not a panacea to cure all the world’s data ills, nor does it pretend to be, nor is any single tool. However, there are some specific disadvantages that are looking to be addressed in Halecium.

NiFi is an open source Apache project and as such holds to the Apache ideal of community involvement and mostly community driven decision making. While much of the roadmap planning and work is done by the core team, anyone can attempt to contribute to NiFi. As such, despite the best efforts of the core team to adhere to the NiFi principles and code quality standards, the quality of submissions varies widely from submission to submission. For example, some follow data provisioning guidelines closely, some don’t provision at all. Some maintain dependencies as per best practice, some don’t. Some test thoroughly, some don’t. Etc. Because of this, functionality for your specific needs cannot be assumed (regardless of whether it is technically supported or not) and each use case requires extensive unit and integration testing to ensure functionality (the more complex or outside the box the case, the truer this is) beyond what would be expected of a production level tool. Oftentimes, even simple use cases can require code modifications which add cost and complexity to implementation and add further testing. Additionally, being a community open source project, NiFi can change immensely from version to version (as can the UI) and the pace of change is rapid. Since much of NiFi is interwoven at a very low level, this can mean supporting multiple versions at once in order to try to meet all your needs.

Also, since NiFi is a community open source project, it is subject to code bloat at a much faster rate than what could be expected of a more controlled source application. As a system that has evolved from its original purposes, some of the design considerations at the core of NiFi are now major limitations. For example, the interweaving of code at the base level of NiFi does not allow for full modularity of different sections. Also, as there is no repository for custom code outside the core NiFi project, custom cases tend to make their way into the main NiFi code. These again lead to code bloat (and much duplication of work between implementations). As such, NiFi has a relatively heavy memory and processing footprint on machines it runs on. While in most cases this is not an issue, this does cause issues running NiFi as a background listener on laptops or running NiFi on small edge devices. While NiFi can run on something as small as a Raspberry Pi, you’re not going to be able to run much else on there.

To get around this issue, MiNiFi is being developed to reduce the footprint of NiFi. Currently, MiNiFi comes in two flavors: a Java version with a smaller footprint, and a C++ version to run on very small devices. All this means that to support your full needs, you may need to run NiFi (possibly multiple versions) with up to two major versions of MiNiFi with sub-versions, all possibly with custom work.

There are other technical limitations, however these tend to run into matter of design and technical opinion and as such will not be covered here. As suggested above, if you are interested please visit the websites and pull down NiFi and run it and/or check the developer forums. NiFi is a great tool managed by very dedicated people, and everyone should judge for themselves whether it fits their needs or not.

 

The Case for Halecium

 Halecium is born of a fundamental disagreement on the direction of NiFi and MiNiFi since NiFi 1.0. This relates to not only what the roadmap should be, but also what the nature of NiFi itself should be. There is no right answer to this question as everyone can have a different opinion, however one of the advantages of open source is that if you don’t like something, go do it differently yourself.

Much of the business case for Halecium is the same as the business case for NiFi. By moving data faster, more securely, to where you want it, and from close to the source, the value of the data is potentially increased by making it more temporally relevant and the cost of infrastructure is decreased by moving much of the E and L processing closer to the source and reducing the L footprint needed (think temporary data sinks in data warehouse projects). Additionally, by allow connections to nearly any source and any sink, the value to extend to almost any platform. Here things begin to diverge, however. While NiFi is an open source project, much of the team behind it works for Hortonworks. As such, there is an understandable business integrate NiFi closer to Hortonworks version of the Hadoop platform (such as with Ambari), in the interest of enhancing the value to Hadoop customers and solving needs particular to the Hadoop community. There is nothing wrong with this model, and as an open source project if the community so allows then so be it. However, I believe that NiFi should be a fully generic EtL tool that favors no platform so as to maximize the potential benefits across all enterprise applications. As I have communicated in other portions of my framework, generic flexibility is a core belief I hold in maximizing the value to all enterprises. As stated in the disadvantage section, there is no right or wrong answer here, it is just a matter of opinion and design preference.

Being that Halecium is new and has the advantages of learning from the experience of NiFi (and my experiences as a NiFi contributor and architect), Halecium can also be designed fresh from the ground up for the purposes that NiFi now serves rather than its purposes for the NSA. This leads to several major design changes:

·       All functional sections will be fully modular

·       There will be no internal web server

·       Derby is being replaced by PostgreSQL

·       Versioning will be done on a module level

·       C++ 11 will be the language of the core

·       NiFi and MiNiFi use cases will be supported by Halecium without there being separate versions

·       This will not be open source in the Apache sense, but more of an open source/source included hybrid

·       ETL is allowed (with T modules)

·       Processors may handle one full logical task rather than one function of a logical task

·       Unlike MiNiFi C++, non-standard libraries will not be used or acceptable

·       Custom additions are accepted and sequestered in a separate module for community use

·       Configuration can be done remotely though connected Halecium nodes

·       Data provenance will extend beyond Halecium

 

Halecium Specific Differences at a High Level

 As discussed in the above sections, Halecium is meant as a next generation to NiFi and not as a clone. No NiFi code will be reused in Halecium, and in fact it is in a different language entirely. That being said, many of the concepts will be similar, and where possible best effort will be made to share terminology so as to make moving back and forth between the tools as easy as possible.

C++ 11

Why C++ 11 rather than Java? 

Part of the need for the split between NiFi and MiNiFi (especially MiNiFi C++) is the nature of the JVM itself and the minimal footprint required to run on it. The nature of the JVM also impacts the ability of NiFi/MiNiFi to run on Windows machines. Switching to C++ solves this issue and allows for a wider array of hardware configuration implementations. While C is also an option to solve this issue, C is not an option for coding on a timeline.

Additionally, there is the specter of Oracle hanging over all of Java (and MySQL). Oracle’s aggressive monetization of their open source Sun IP means that using Java is becoming progressively financially risky. While OpenJDK has attempted to answer this (although Oracle is party to that as well) on Linux, there is not a viable solution as of yet on Windows that is freely available from anyone but Oracle. As such, it’s better to avoid Java all together for applications requiring cross platform support so as to not introduce future liabilities to Oracle.

C++ 11 is standard on most Linux and Windows machines and as such Halecium should be able to support near any deployment (post-build). Additionally, since nothing outside of standard C++ 11 will be used (i.e. no Boost), Halecium can be expected to be production conformance friendly.

Modern Java performance is mostly comparable to C++ (assuming developers with expertise), so performance gain can be expected to mostly be minimal except in cases of direct hardware interface applications (such as CUDA). There is a development time tradeoff by using C++ instead of Java, however this is more than accounted for by the issues listed above.

PostgreSQL

Why PostgreSQL instead of embedded Derby, MySQL, Cassandra, etc.?

MySQL has the same Oracle issues as Java. While I am a huge fan of Cassandra as the database design of the future, it has much to heavy a footprint for small deployments. Derby is a marginal production tool with a diminishing contributor base. H2 is an excellent embedded database, but it is Java only and single threaded. MongoDB does not have the performance necessary and may not have a future with both Cassandra and PostgreSQL emulating its functionality.

PostgreSQL has undergone major improvements and has a nice NoSQL functionality to go with its inherent RDBMS functionality. Additionally, PostgreSQL can be deployed in a lightweight fashion as part of the application deployment. While PostgreSQL does not have the performance in all areas of some of its competitors, it is expected to be sufficient for Halecium’s needs (based on publicly available performance benchmark comparisons), and where it is not it can be improved through involvement in PostgreSQL’s open source community.

Due to the modularity of Halecium, if you prefer something else or nothing at all simply leave it out (but accept the functional consequences that will be explained at a later date).

Web Server

Why no embedded web server such as NiFi has with Jetty?

The internal NiFi Jetty webserver certainly allows for ease of build and deployment. However, this internal web server arrangement limits the flexibility to deploy it elsewhere on a platform of an enterprise’s choosing. By making the web server external to the core modules, Halecium can better conform to varying needs and standards with almost no complexity cost for the end user.

Modularity

Why change from how NiFi is structured?

NiFi is meant to be modular to allow functionality to be added and removed as needed. While this is true to some extent, much of NiFi is interwoven at a low level due to both original design and open source coding (hence the reason for the need for MiNiFi Java). This also means that versioning is at the top NiFi level and not the sub-module, meaning that to change the version you have to change your entire NiFi build (in most cases). This also leads to bloat in the size of NiFi, limiting its ability to squeeze onto smaller devices or into the background.

By going fully modular, Halecium allows for you to truly only have the functionality you need. This means the Halecium will have a much lower size and footprint, additional from the gains from not needing a JVM. This will allow Halecium to scale from very small IoT devices all the way up to servers without needing different versions. Additionally, since modules are fully independent, versioning can be done at a module level. If a module change doesn’t affect your modules, just ignore it. This should reduce versional thrashing.

Apache Open Source vs open source

NiFi has embraced the Apache open source concepts with open arms. Halecium will not be open source in the same way. While the source for Halecium will be made available and contributions taken, the roadmap and code will be tightly controlled. With Halecium, if the community disagrees, their input will be taken into advisement but the roadmap may or may not be altered. With NiFi, the community is the driver (to a point).

For Halecium, this means you can expect a clear and consistent roadmap and higher quality standards at the cost of slower development and innovation. Some may consider this source included rather than open source, however community involvement will still be accepted unlike a pure source included project. As with other things here, this is a matter of preference rather than right or wrong and a focus on terminology is a waste of time and effort.

ETL

NiFi is consciously governed to not allow transformations in the traditional ETL sense. As stated above, it can be thought of either an EL tool or an EtL tool. There are various reasons for this, not least of which are complexity and footprint.

Halecium will allow for ETL, but in the sense of EL and T. Being that Halecium is fully modular, a T module would be perfectly acceptable. This is not a focus however, and will be something added either as needs require it or the community wishes to add it. Primarily, Halecium is about the E and the L just as NiFi is.

Processor Scope

By standard, NiFi processors handle only one function. To perform a logical task, a flow normally must contain multiple processors. For example, extracting data from XML, changing the data, then updating the XML may require three processors to perform.

While Halecium supports this model, Halecium by standard also allows for logical groupings of functions within one processor. As such, Halecium has multiple classes of processors (i.e. function, task, etc.) whereas NiFi has just on (function). The goal is to reduce the complexity to users who may not know all the functions needed to accomplish one task and to reduce unneeded clutter within the flow. The performance gain of task processors over strings of function processors should be minimal but not insignificant on complicated flows with lots of data flowing through them.

Custom Repos

Long on the NiFi roadmap, but recently mostly abandoned are public custom repos. The idea of the custom repo is to allow the sharing of processors within the community so as to reduce duplicated work. As NiFi has not had this, custom code either ends up in NiFi itself (resulting in bloat) or is lost to the ether and reinvented each time its needed.

Halecium will include a module that will allow for custom processors to be included that may be of use to the community. Additionally, flows for specific tasks may be included to also reduce rework and to allow for iterative improvement.

Remote Configuration

Halecium will contain secure site-to-site capabilities similar to NiFi. However, Halecium will go one step further in allow configurations of attached nodes to be set through other connected nodes. In practice, this means that if you have thousands of Halecium nodes embedded on IoT sensors, all of their processor and internal configurations can be changed through one connected central node. This aims to simplify configuration without the need to visit each flow or to integrate with a platform specific configuration manager (i.e. NiFi with Ambari).

Data Provenance

NiFi’s data provenance ability extends only to NiFi’s borders. Halecium’s data provenance extends beyond its borders to the source or the destination.

So long as it conforms with Halecium’s expected format, Halecium will accept data provenance entries and store them along with ones from Halecium’s own actions. Halecium can then pass this data provenance on to the destination of your choice so that all data provenance can be aggregated in one place (such as a Data Provenance/Data Governance engine).

Others

Other differences of both a technical and aesthetic nature will be present between NiFi and Halecium, however those will be discussed at a later date during or after initial implementation.

 

Conclusion

 While NiFi was a major step forward in the evolution of data movement tools, Halecium will be the next step. The modularity and performance characteristics of Halecium will allow deployments on a much greater range of devices that NiFi, without the need for separate MiNiFis. Additionally, since Halecium has a much lighter footprint, it can be deployed on laptops, servers, desktops, etc. in order to retrieve relevant log data and facilitate data governance operations.

Halecium maintains the security, ability, and flexibility of NiFi but incorporates lessons learned that will increase business value and make implementation and maintenance easier.


For questions or comments, please contact me at daniel.cave@s2cinteractive.com

To view or add a comment, sign in

Others also viewed

Explore content categories