3 Tips for Stream Processing Data

Bryan Jacobs

Published Apr 14, 2017

Every new adventure leads to new realizations. I've spent the last several months working on stream processing with technologies that include: Spark, Spark Streaming, Kafka, Kafka Streams, and Kafka Connect. Below are three important things I wish I had known before I started.

A Critical Question

First and foremost you must answer the question, "Does my decision making improve?". The complexity around stream processing is real, so make sure you have solid requirements that explain how your business will be able to make better decisions. A POC can help uncover any unknown technology concerns.

Cost

Vendors like Confluent offer a very robust stable version of Kafka. However, you will pay for it. Then again if you attempt to implement a Kafka cluster on your own you will pay for it in blood, sweat, and tears.

When using a vendor consider if they charge for read/write costs. The reason that this is important is because streaming processing SHOULD be a microservice based approach. However, the more you break down your processing applications you will increase your read and write operations. Basically, read and write are equivalent to data copying.

To reduce costs you'll have to consider:

What data should be included in stream processing
How to reduce the size of the data
How to reduce the read/write of the data

Recommended by LinkedIn

Kafka Demystified: A Clear Introduction

Ali Mirzaee nejad farsangi 10 months ago

RabbitMQ vs. Kafka: Choosing the Right Messaging…

Kishor Pant 1 year ago

What is Kafka, Why It's So Important Nowadays?

Kamlesh Kumar 1 year ago

Another aspect to cost is attempting to run an Apache Kafka cluster on your own. While this is doable its also very complex. Here you pay the cost in terms of engineers who can operationalize and maintain the cluster.

Confluent Kafka clusters literally just run without issue. AWS Kafka clusters not so much, and if you run them yourselves well...remember the comment about flood, sweat, and tears...its a real thing :)

Some Things Really Are Batch

You can tell you have a batch process when you have to reprocess a ton of data because something has changed. Another key to recognizing a batch process, is when you need some intermediary store to hold all the data that you need to reprocess. This intermediary datastore usually emerges because, as mentioned above, some event requires you to reprocess everything. If you notice this happening...just embrace the batch process and move on.

So Now What

The long and short of it is no matter what technologies you use these three things seem to hold true.

I hope these little tidbits help you out! Happy streaming!

To view or add a comment, sign in

3 Tips for Stream Processing Data

Bryan Jacobs

A Critical Question

Cost

Recommended by LinkedIn

Some Things Really Are Batch

So Now What

More articles by Bryan Jacobs

Others also viewed

Exploring Kafka: Unlocking the Power of Event Streaming

Kafka- Introduction to real-time data streaming

Kafka - Monitor producer metrics using JMX, Prometheus and Grafana

Harnessing the Power of Apache Kafka in Real-Time Data Streaming

Where Kafka Struggles and Pulsar Shines: Real-World Scenarios Explained

Kafka Streams - Avoid Stop The World During repartitioning in Openshift / Kubernetes

Apache Kafka: The Distributed Event Streaming Platform for Real-Time Data Processing

Apache Kafka

Demystifying Kafka: A Visual Guide to Distributed Messaging

Explore content categories

A Critical Question

Cost

Recommended by LinkedIn

Some Things Really Are Batch

So Now What

More articles by Bryan Jacobs

Java Multi-Threaded Program Optimized to return Completed Tasks

The How: Behind Continuous Integration

Apache Kafka - Streaming Data and Real-time UI's

Others also viewed

Exploring Kafka: Unlocking the Power of Event Streaming

Kafka- Introduction to real-time data streaming

Kafka - Monitor producer metrics using JMX, Prometheus and Grafana

Harnessing the Power of Apache Kafka in Real-Time Data Streaming

Where Kafka Struggles and Pulsar Shines: Real-World Scenarios Explained

Kafka Streams - Avoid Stop The World During repartitioning in Openshift / Kubernetes

Apache Kafka: The Distributed Event Streaming Platform for Real-Time Data Processing

Apache Kafka

Demystifying Kafka: A Visual Guide to Distributed Messaging

Similar topics

Tips to Reduce AWS Expenses

Batch Processing in Big Data

Tips for Optimizing Apache Spark Performance

Explore content categories