3 Tips for Stream Processing Data

3 Tips for Stream Processing Data

Every new adventure leads to new realizations. I've spent the last several months working on stream processing with technologies that include: Spark, Spark Streaming, Kafka, Kafka Streams, and Kafka Connect. Below are three important things I wish I had known before I started.

A Critical Question

First and foremost you must answer the question, "Does my decision making improve?". The complexity around stream processing is real, so make sure you have solid requirements that explain how your business will be able to make better decisions. A POC can help uncover any unknown technology concerns.

Cost

Vendors like Confluent offer a very robust stable version of Kafka. However, you will pay for it. Then again if you attempt to implement a Kafka cluster on your own you will pay for it in blood, sweat, and tears.

When using a vendor consider if they charge for read/write costs. The reason that this is important is because streaming processing SHOULD be a microservice based approach. However, the more you break down your processing applications you will increase your read and write operations. Basically, read and write are equivalent to data copying.

To reduce costs you'll have to consider:

  • What data should be included in stream processing
  • How to reduce the size of the data
  • How to reduce the read/write of the data

Another aspect to cost is attempting to run an Apache Kafka cluster on your own. While this is doable its also very complex. Here you pay the cost in terms of engineers who can operationalize and maintain the cluster.

Confluent Kafka clusters literally just run without issue. AWS Kafka clusters not so much, and if you run them yourselves well...remember the comment about flood, sweat, and tears...its a real thing :)

Some Things Really Are Batch

You can tell you have a batch process when you have to reprocess a ton of data because something has changed. Another key to recognizing a batch process, is when you need some intermediary store to hold all the data that you need to reprocess. This intermediary datastore usually emerges because, as mentioned above, some event requires you to reprocess everything. If you notice this happening...just embrace the batch process and move on.

So Now What

The long and short of it is no matter what technologies you use these three things seem to hold true.

I hope these little tidbits help you out! Happy streaming!

To view or add a comment, sign in

More articles by Bryan Jacobs

Others also viewed

Explore content categories