Improving Kafka Performance - A thought

Kafka is heavily used at several companies today. So, I sat around wondering how I would approach if I get tasked to reduce our cost of ownership for kafka. Here is what I came up with:

Step 1: Understand the situation:

Lets start by asking some basic questions to understand our context. For e.g.

1) How much is our kafka ownership cost and how fast is it growing ?

2) Is that really our biggest priority right now?

3) What does cost mean? For e.g.

 3.1) Maintenance cost - install, reinstall, lost data, upgrade, access control, hardware/os maintenance

 3.2) Actual hardware cost

 3.3) Engineer use cost - everyone having to learn kafka api, learn and tune configs etc 

4) Have we done a basic check that kafka is suitable for our use cases for kafka are reasonable? Are we misusing kafka ?


Step 2: List out options based on understanding of situation:

For sake of argument, lets assume that #1 and #2 show that reducing kafka cost is a priority. So, lets discuss what are for #3.1, #3.2 and #3.3.

If #3.1 was the driving factor of cost, we could either setup a centralized team that provided managed kafka clusters to the entire company or consider using a cloud hosted kafka provider. 

If #3.2 was the driving factor of cost, then we could do some analysis to determine whether 1) we can reduce the hardware cost itself or 2) we need to change our usage/rewrite some code to improve performance. Lets dive into how we might try to reduce the cost of hardware itself. There are a few checks we could do:

1) Check if we are using right hardware? Are CPUs/Disks/Network cards proportional and of the right kinds? (are all resources near fully utilized). If not rearrange things to improve cost/efficiency.

2) Can we use cheaper/costlier/differently configured components (for e.g. mechanical HDs instead of SSDs, or SSDs if there are too many consumers reading randomly )

3) Negotiate better pricing with vendor given scale of our hardware purchase for kafka?

For sake of argument, lets assume that we find that cost reduction at hardware level alone is not enough. So, lets consider what can we do at software layer to reduce cost. Some of the things that we could try include:

1) Tune kafka application level settings (kafka server, kafka client)

2) Tune OS level settings to improve kafka performance? ( tcp buffers, IO buffers, kafka limits, disk caching config changes etc) 

3) Co locate consumers with kafka servers (for e.g. if network cards are bottleneck because data transferred is too much)

4) Remove functionality from kafka to improve performance? ( for e.g. recent kafka allows us to delete messages, which probably uses RAM)

5) Rewrite portion of kafka code with better libraries? (for e.g. client that splits data using erasure coding to reduce disk consumption if that matters, better tree implementation, or caching strategy)

For sake of further argument, lets assume that software cost reduction is not sufficient. At this point, we should perhaps consider our top use cases to see if kafka is the right solution for them. If not, then perhaps we could design and implement a solution for our use cases from scratch.

Step 3: Hypothesize, validate, deploy, repeat:

Now that we have a high level idea of various things we could do to improve hardware costs, we need to prioritize among the various approaches and for a selected hypothesis, do back of envelope calculations to do basic validation, make the change, measure performance before and after the change and then decide whether its worth pushing to production. Once we find changes that are worth deploying, you work with stakeholders to prioritize their deployment. 


An interesting question that we didn't dive into here is our top use cases and whether kafka is really a good fit for them or not. I think there is a lot of interesting meat to that question, and perhaps I would share my thoughts on that in a seperate post.

Thanks

Umesh

To view or add a comment, sign in

More articles by Umesh Kumar

  • My 2021 reading update

    I had a mildly successful 2021 for my reading goals. I managed to complete reading 10 books last year which was 2 short…

    1 Comment
  • Kubernetes in Action

    I just finished reading Kubernetes in Action — my first technical book for this year. (this one https://www.

    2 Comments
  • How to Deliver presentations

    I recently joined a toastmasters group to get some regular speaking practice. I decided to practice regularly because I…

    1 Comment
  • Books to read in 2021

    Inspired by people sharing their 2020 reading list, I would love to hear your book recommendations for my reading in…

    4 Comments
  • Intro /First 1:1s in age of Covid

    When you join a new job, you meet a lot of people for first time during ramp up. The goal of these meetings is both to…

    1 Comment
  • Evaluating a system's design

    Engineers often design new system based on their past experience, knowledge and gut. Sharing some questions that I…

  • Debugging software - A step by step checklist

    Debugging is a reality for all engineers. I was reading "The Practice of Programming" by Kernighan and Ritchie today…

  • System Design - Aadhar Card for Properties -Part 1

    India has been making bold strides towards a digital economy like Aadhar and cashless economy. So, a conversation made…

  • Your java skills on a scale of 1-10

    Being asked to rate your programming skills on a scale of 1-10 is a often asked question in software engineering…

  • A real time messaging system - Evolving design with Requirements

    Continuing in this series of how requirements might evolve for systems, this time I explored how a real time messaging…

Others also viewed

Explore content categories