Improve Streamsets pipelines performance with loose coupling architecture - Part 2

Improve Streamsets pipelines performance with loose coupling architecture - Part 2

This is the second part of the blog I posted earlier. In the end of the first part I was discussing disadvantages of design with orchestration and 1:1 ratio between jobs and documents.

To address these disadvantages, let’s go even further with decoupling and stage intermediate data. The first pipeline would stage Document data in a shared location, from where it could be read by the second pipeline in parallel streams. 

Article content

For example, let’s stage Documents in a Kafka topic (this option may be especially handy when staging CDC data). Parallelism for Items extraction would be achieved by partitioning the topic. When writing into the Kafka topic, the partitioning key could be made either from some field(s) of a Document or their hash – the latter can be computed e.g. using a Groovy evaluator:

Article content

That hash value would be used in the Kafka destination:

Article content

For Items extraction, the second pipeline should start with a Kafka origin to read the topic, with a subsequent REST call to extract Items. Make sure the Max Batch Size in the Kafka origin is not too high, otherwise the OOM error risk would strike back.

In a job definition for Items extraction we should provide the number of parallel running pipeline instances; they will be spread across all available SDCs, each instance reading one partition:

Article content

Here, if the Item extraction pipeline fails, it can be easily retried, because topic offsets are persisted by Kafka.

Kafka is a kind of low hanging fruit: it naturally fits the requirements of this scenario. To practice with this design, I recommend Streamsets course on loosely coupled architecture. But what if you only have file storage – say, in Azure Data Lake?

The Documents pipeline would be almost the same – create a partitioning key, and use it in the target directory's path:

Article content

When extracting Items, the difference with Kafka use case is that instead of a single job running multiple pipeline instances, here we should run multiple jobs, each addressing a specific partition.

The pipeline for Items extraction would accept a partition number parameter, and read from that partition in Azure Data Lake Storage’s directory:

Article content

To create and start parallel jobs from Job Templates, we need to generate the list of partitions, and then use the Start Jobs stage. For demo purposes I’ve just hardcoded it the partitions list, but it could be done more "dynamically":

Article content

These values are then used as partition parameters when starting Jobs from Job Templates:

Article content

Note that generated Jobs are started in the background and are not deleted upon completion; in fact, we don’t want them to complete. These jobs would constantly poll the source directory and collect Documents files as soon as they appear. It’s possible to run those jobs in Batch mode, too, however.

In both these examples, the Documents and Items pipelines are truly decoupled: they work independently, and each can be restarted after failover without data loss. Control Hub’s admin overheads for job management is minimised. Lastly, the only data persisted in the log is a partition number, which can hardly be called sensitive data.

These examples are not meant to be exact blueprints for your pipelines; the end design it’s up to a data engineer. When taking into account your business requirements, as well as source and target systems’ logic and performance implications, you should be able to use these patterns and build robust, fast and scalable data pipelines, working either in batch or streaming mode.

To view or add a comment, sign in

More articles by Roman Bukarev

Others also viewed

Explore content categories