Improve Streamsets pipelines performance with loose coupling architecture - Part 2

Roman Bukarev

Published May 8, 2024

This is the second part of the blog I posted earlier. In the end of the first part I was discussing disadvantages of design with orchestration and 1:1 ratio between jobs and documents.

To address these disadvantages, let’s go even further with decoupling and stage intermediate data. The first pipeline would stage Document data in a shared location, from where it could be read by the second pipeline in parallel streams.

For example, let’s stage Documents in a Kafka topic (this option may be especially handy when staging CDC data). Parallelism for Items extraction would be achieved by partitioning the topic. When writing into the Kafka topic, the partitioning key could be made either from some field(s) of a Document or their hash – the latter can be computed e.g. using a Groovy evaluator:

That hash value would be used in the Kafka destination:

For Items extraction, the second pipeline should start with a Kafka origin to read the topic, with a subsequent REST call to extract Items. Make sure the Max Batch Size in the Kafka origin is not too high, otherwise the OOM error risk would strike back.

In a job definition for Items extraction we should provide the number of parallel running pipeline instances; they will be spread across all available SDCs, each instance reading one partition:

Here, if the Item extraction pipeline fails, it can be easily retried, because topic offsets are persisted by Kafka.

Kafka is a kind of low hanging fruit: it naturally fits the requirements of this scenario. To practice with this design, I recommend Streamsets course on loosely coupled architecture. But what if you only have file storage – say, in Azure Data Lake?

Recommended by LinkedIn

001 - Medallion Architecture: Bronze, Silver, and Gold…

Calebe Brim 1 year ago

How I cope with large architecture model that…

Maksim Aniskov 1 year ago

Android Clean Architecture: Data Layer (Part 1)

Robert Constantin 3 years ago

The Documents pipeline would be almost the same – create a partitioning key, and use it in the target directory's path:

When extracting Items, the difference with Kafka use case is that instead of a single job running multiple pipeline instances, here we should run multiple jobs, each addressing a specific partition.

The pipeline for Items extraction would accept a partition number parameter, and read from that partition in Azure Data Lake Storage’s directory:

To create and start parallel jobs from Job Templates, we need to generate the list of partitions, and then use the Start Jobs stage. For demo purposes I’ve just hardcoded it the partitions list, but it could be done more "dynamically":

These values are then used as partition parameters when starting Jobs from Job Templates:

Note that generated Jobs are started in the background and are not deleted upon completion; in fact, we don’t want them to complete. These jobs would constantly poll the source directory and collect Documents files as soon as they appear. It’s possible to run those jobs in Batch mode, too, however.

In both these examples, the Documents and Items pipelines are truly decoupled: they work independently, and each can be restarted after failover without data loss. Control Hub’s admin overheads for job management is minimised. Lastly, the only data persisted in the log is a partition number, which can hardly be called sensitive data.

These examples are not meant to be exact blueprints for your pipelines; the end design it’s up to a data engineer. When taking into account your business requirements, as well as source and target systems’ logic and performance implications, you should be able to use these patterns and build robust, fast and scalable data pipelines, working either in batch or streaming mode.

To view or add a comment, sign in

Improve Streamsets pipelines performance with loose coupling architecture - Part 2

Roman Bukarev

Recommended by LinkedIn

More articles by Roman Bukarev

Others also viewed

Lambda and Kappa Design Patterns

Architecture Day 1: CQRS

Choosing between Data and Event-Driven Architecture

Medallion Architecture: An Old Concept Repackaged with New Hype

From Chaos to Clarity: Migrating Legacy Data Pipelines to Medallion Architecture Lessons learned from a real-world migration using DBT

Data Medallion architecture - my view

Data-Centered Architecture - First Thoughts

Introduction to Event-Driven Architecture (EDA)

Delta vs Iceberg vs Hudi: Why This Is an Enterprise Architecture Decision, Not a Tool Choice

System Architecture for Processing high volume of Real-Time Data

Explore content categories

Recommended by LinkedIn

More articles by Roman Bukarev

Push Your Data Into Snowflake Tables With Streamsets Using Snowpipe Streaming