Improving Kafka Performance - A thought
Kafka is heavily used at several companies today. So, I sat around wondering how I would approach if I get tasked to reduce our cost of ownership for kafka. Here is what I came up with:
Step 1: Understand the situation:
Lets start by asking some basic questions to understand our context. For e.g.
1) How much is our kafka ownership cost and how fast is it growing ?
2) Is that really our biggest priority right now?
3) What does cost mean? For e.g.
3.1) Maintenance cost - install, reinstall, lost data, upgrade, access control, hardware/os maintenance
3.2) Actual hardware cost
3.3) Engineer use cost - everyone having to learn kafka api, learn and tune configs etc
4) Have we done a basic check that kafka is suitable for our use cases for kafka are reasonable? Are we misusing kafka ?
Step 2: List out options based on understanding of situation:
For sake of argument, lets assume that #1 and #2 show that reducing kafka cost is a priority. So, lets discuss what are for #3.1, #3.2 and #3.3.
If #3.1 was the driving factor of cost, we could either setup a centralized team that provided managed kafka clusters to the entire company or consider using a cloud hosted kafka provider.
If #3.2 was the driving factor of cost, then we could do some analysis to determine whether 1) we can reduce the hardware cost itself or 2) we need to change our usage/rewrite some code to improve performance. Lets dive into how we might try to reduce the cost of hardware itself. There are a few checks we could do:
1) Check if we are using right hardware? Are CPUs/Disks/Network cards proportional and of the right kinds? (are all resources near fully utilized). If not rearrange things to improve cost/efficiency.
2) Can we use cheaper/costlier/differently configured components (for e.g. mechanical HDs instead of SSDs, or SSDs if there are too many consumers reading randomly )
3) Negotiate better pricing with vendor given scale of our hardware purchase for kafka?
For sake of argument, lets assume that we find that cost reduction at hardware level alone is not enough. So, lets consider what can we do at software layer to reduce cost. Some of the things that we could try include:
1) Tune kafka application level settings (kafka server, kafka client)
2) Tune OS level settings to improve kafka performance? ( tcp buffers, IO buffers, kafka limits, disk caching config changes etc)
3) Co locate consumers with kafka servers (for e.g. if network cards are bottleneck because data transferred is too much)
4) Remove functionality from kafka to improve performance? ( for e.g. recent kafka allows us to delete messages, which probably uses RAM)
5) Rewrite portion of kafka code with better libraries? (for e.g. client that splits data using erasure coding to reduce disk consumption if that matters, better tree implementation, or caching strategy)
For sake of further argument, lets assume that software cost reduction is not sufficient. At this point, we should perhaps consider our top use cases to see if kafka is the right solution for them. If not, then perhaps we could design and implement a solution for our use cases from scratch.
Step 3: Hypothesize, validate, deploy, repeat:
Now that we have a high level idea of various things we could do to improve hardware costs, we need to prioritize among the various approaches and for a selected hypothesis, do back of envelope calculations to do basic validation, make the change, measure performance before and after the change and then decide whether its worth pushing to production. Once we find changes that are worth deploying, you work with stakeholders to prioritize their deployment.
An interesting question that we didn't dive into here is our top use cases and whether kafka is really a good fit for them or not. I think there is a lot of interesting meat to that question, and perhaps I would share my thoughts on that in a seperate post.
Thanks
Umesh