Hazelcast Jet: High Performance Stream Processing with DAG
Riaz Mohammed

Hazelcast Jet: High Performance Stream Processing with DAG

Hazelcast Jet® is a high-performance distributed stream processing engine from Hazelcast. Jet can be used to develop stream processing applications as well as batch processing applications. Jet uses directed acyclic graph (DAG) at its core. Jobs modelled as DAG are processed across multiple nodes in a distributed fashion.

What is DAG?

A graph is a series of vertices connected by edges. In a directed graph, the edges are connected such that each edge only goes one way or data flow is unidirectional. A directed acyclic graph (DAG) means that the graph is also not cyclic, or there is no way to start at any vertex and follow a consistently directed sequence of edges that loops back to that same vertex. Let's look at some visual representations. A vertex is represented as circle and lines are edges.

                                                         fig:a) Example DAG

As seen above, each vertex in DAG is directed from an earlier vertex to a later vertex in a sequence. This is also known as a topological ordering of a graph. Each vertex is taking an inbound event, doing some computation and pushing the result to the outbound edge to processed by the next vertex.

fig:b)

The above figure provides a general process flow represented as DAG. It flows only in one direction, left to right. It is not possible to loop back to the same node. The functions like welcome , age-check , gender-check are done in order, and there is no way to go back to “welcome” stage once it is completed and the process has moved on to age-check. This mean that each stage is processing the incoming event independently and without interacting with any other stage.

 Why is this important for Hazelcast Jet ?

Jet models data processing computation as DAG, with every vertex a computation step and edges representing data-flow. As we have already seen, a vertex performs computation on incoming data independent to any other vertex, which makes it easy to parallelize and distribute the computation across Jet nodes.

In Jet, both the edge and the vertex are distributed entities. A vertex is a ‘processor’ which performs a computation on incoming data . An edge between two vertices is implemented with many data connections, both within a member and between members. To distribute the processing, Jet has to direct data correctly to the relevant vertex to perform the computation, Jet uses a method called data partitioning.. Jet uses a consistent algorithm to compute partition-key for each data item and map all related items to the same key. This allows Jet to direct related items to a particular processor or vertex without any computational overheads and optimizing memory requirements.

 Let’s look at how Hazelcast Jet can be used for distributed word count by representing the computation as a DAG. If there are multiple Jet instances, the task to count the occurrences of each word needs to be distributed across all members. The steps for the word count will be to

  1.  Read the lines of text
  2. Tokenize each line to separate words
  3. Accumulate/group each word 
  4. Combine the counts of each words across all Jet instances          
  5. The diagram below represents this process.

The code below creates the DAG for the word count example using the DAG Core API of Jet, which is very descriptive.

However, the programming paradigm for Hazelcast is the ‘Pipeline API’, which provides a simpler way of representing computations which Jet will automatically convert to DAG.

A pipeline is a network of interconnected stages. The stages form a directed acyclic graph (a DAG) and the connections between them indicate the path along which the data flows, with data processing occurring at each stage. Below is how the same word count example could be developed using pipeline.

It looks more simple and concise. The above pipeline shows how the different stages such as reading text from map, splitting them into words, grouping them and then aggregating them can be defined and chained in a way that is easy to develop and understand. Pipeline API is more similar to Java Stream API, which make the learning curve to start using Jet a lot shorter.

 Hazelcast Jet is built on top of Hazelcast IMDG®, the leading open source in-memory data grid with tens of thousands of installed clusters. Hazelcast Jet processing jobs take full advantage of the distributed in-memory data structures provided by Hazelcast IMDG. Hazelcast Jet is the 3rd generation big data processing engine. It is an application embeddable, distributed computing platform for fast processing of big data sets.There is a broad selection of useful demo applications built using Hazelcast Jet which is available for download and use at https://jet.hazelcast.org/demos/

To view or add a comment, sign in

More articles by Riaz Mohammed

Others also viewed

Explore content categories