Improve Streamsets pipelines performance with loose coupling architecture - Part 2
This is the second part of the blog I posted earlier. In the end of the first part I was discussing disadvantages of design with orchestration and 1:1 ratio between jobs and documents.
To address these disadvantages, let’s go even further with decoupling and stage intermediate data. The first pipeline would stage Document data in a shared location, from where it could be read by the second pipeline in parallel streams.
For example, let’s stage Documents in a Kafka topic (this option may be especially handy when staging CDC data). Parallelism for Items extraction would be achieved by partitioning the topic. When writing into the Kafka topic, the partitioning key could be made either from some field(s) of a Document or their hash – the latter can be computed e.g. using a Groovy evaluator:
That hash value would be used in the Kafka destination:
For Items extraction, the second pipeline should start with a Kafka origin to read the topic, with a subsequent REST call to extract Items. Make sure the Max Batch Size in the Kafka origin is not too high, otherwise the OOM error risk would strike back.
In a job definition for Items extraction we should provide the number of parallel running pipeline instances; they will be spread across all available SDCs, each instance reading one partition:
Here, if the Item extraction pipeline fails, it can be easily retried, because topic offsets are persisted by Kafka.
Kafka is a kind of low hanging fruit: it naturally fits the requirements of this scenario. To practice with this design, I recommend Streamsets course on loosely coupled architecture. But what if you only have file storage – say, in Azure Data Lake?
Recommended by LinkedIn
The Documents pipeline would be almost the same – create a partitioning key, and use it in the target directory's path:
When extracting Items, the difference with Kafka use case is that instead of a single job running multiple pipeline instances, here we should run multiple jobs, each addressing a specific partition.
The pipeline for Items extraction would accept a partition number parameter, and read from that partition in Azure Data Lake Storage’s directory:
To create and start parallel jobs from Job Templates, we need to generate the list of partitions, and then use the Start Jobs stage. For demo purposes I’ve just hardcoded it the partitions list, but it could be done more "dynamically":
These values are then used as partition parameters when starting Jobs from Job Templates:
Note that generated Jobs are started in the background and are not deleted upon completion; in fact, we don’t want them to complete. These jobs would constantly poll the source directory and collect Documents files as soon as they appear. It’s possible to run those jobs in Batch mode, too, however.
In both these examples, the Documents and Items pipelines are truly decoupled: they work independently, and each can be restarted after failover without data loss. Control Hub’s admin overheads for job management is minimised. Lastly, the only data persisted in the log is a partition number, which can hardly be called sensitive data.
These examples are not meant to be exact blueprints for your pipelines; the end design it’s up to a data engineer. When taking into account your business requirements, as well as source and target systems’ logic and performance implications, you should be able to use these patterns and build robust, fast and scalable data pipelines, working either in batch or streaming mode.