Improving Kafka Performance - A thought

Umesh Kumar

Published Jan 27, 2017

Kafka is heavily used at several companies today. So, I sat around wondering how I would approach if I get tasked to reduce our cost of ownership for kafka. Here is what I came up with:

Step 1: Understand the situation:

Lets start by asking some basic questions to understand our context. For e.g.

1) How much is our kafka ownership cost and how fast is it growing ?

2) Is that really our biggest priority right now?

3) What does cost mean? For e.g.

3.1) Maintenance cost - install, reinstall, lost data, upgrade, access control, hardware/os maintenance

3.2) Actual hardware cost

3.3) Engineer use cost - everyone having to learn kafka api, learn and tune configs etc

4) Have we done a basic check that kafka is suitable for our use cases for kafka are reasonable? Are we misusing kafka ?

Step 2: List out options based on understanding of situation:

For sake of argument, lets assume that #1 and #2 show that reducing kafka cost is a priority. So, lets discuss what are for #3.1, #3.2 and #3.3.

If #3.1 was the driving factor of cost, we could either setup a centralized team that provided managed kafka clusters to the entire company or consider using a cloud hosted kafka provider.

If #3.2 was the driving factor of cost, then we could do some analysis to determine whether 1) we can reduce the hardware cost itself or 2) we need to change our usage/rewrite some code to improve performance. Lets dive into how we might try to reduce the cost of hardware itself. There are a few checks we could do:

1) Check if we are using right hardware? Are CPUs/Disks/Network cards proportional and of the right kinds? (are all resources near fully utilized). If not rearrange things to improve cost/efficiency.

2) Can we use cheaper/costlier/differently configured components (for e.g. mechanical HDs instead of SSDs, or SSDs if there are too many consumers reading randomly )

3) Negotiate better pricing with vendor given scale of our hardware purchase for kafka?

For sake of argument, lets assume that we find that cost reduction at hardware level alone is not enough. So, lets consider what can we do at software layer to reduce cost. Some of the things that we could try include:

1) Tune kafka application level settings (kafka server, kafka client)

2) Tune OS level settings to improve kafka performance? ( tcp buffers, IO buffers, kafka limits, disk caching config changes etc)

3) Co locate consumers with kafka servers (for e.g. if network cards are bottleneck because data transferred is too much)

4) Remove functionality from kafka to improve performance? ( for e.g. recent kafka allows us to delete messages, which probably uses RAM)

5) Rewrite portion of kafka code with better libraries? (for e.g. client that splits data using erasure coding to reduce disk consumption if that matters, better tree implementation, or caching strategy)

For sake of further argument, lets assume that software cost reduction is not sufficient. At this point, we should perhaps consider our top use cases to see if kafka is the right solution for them. If not, then perhaps we could design and implement a solution for our use cases from scratch.

Step 3: Hypothesize, validate, deploy, repeat:

Now that we have a high level idea of various things we could do to improve hardware costs, we need to prioritize among the various approaches and for a selected hypothesis, do back of envelope calculations to do basic validation, make the change, measure performance before and after the change and then decide whether its worth pushing to production. Once we find changes that are worth deploying, you work with stakeholders to prioritize their deployment.

An interesting question that we didn't dive into here is our top use cases and whether kafka is really a good fit for them or not. I think there is a lot of interesting meat to that question, and perhaps I would share my thoughts on that in a seperate post.

Thanks

Umesh

To view or add a comment, sign in

Improving Kafka Performance - A thought

Umesh Kumar

More articles by Umesh Kumar

Others also viewed

Kafka Series #4

Why Kafka is a good choice?

The next evolution of software-defined infrastructure is here

Practical Tips for Kafka Partitioning, Compression, Retention, and Swimlanes

How I Reduced Database CPU Usage by 65% Just by Optimizing a Counting Middleware

Who is IBM

Choosing the Right Message Broker: Kafka vs RabbitMQ vs NATS

Issue #189

Kafka Rebalancing Nightmare: Lessons from a Real-World Failure

Experiments with Kafka : A Memoir

Explore content categories

More articles by Umesh Kumar

My 2021 reading update

Kubernetes in Action

How to Deliver presentations

Books to read in 2021

Intro /First 1:1s in age of Covid

Evaluating a system's design

Debugging software - A step by step checklist

System Design - Aadhar Card for Properties -Part 1

Your java skills on a scale of 1-10

A real time messaging system - Evolving design with Requirements