Understanding Message Queue
The Problem: Why Do Message Queues Exist?
Imagine you're building a photo-sharing app like Instagram. A user uploads a photo, and your server needs to do several things with it:
Each of these operations takes a couple of seconds.
The Naive (Synchronous) Approach
In the simplest architecture, the client uploads the photo, and the server does all of that work synchronously before returning a response. This works, kind of, but has three serious problems:
The Solution: A Message Queue
Instead of processing the photo immediately, the server:
On the other end of the queue, a pool of worker servers (consumers) pull messages one at a time and process them.
How this solves each problem:
What Exactly Is a Message Queue?
A message queue is a buffer that sits between a producer and a consumer.
How it works:
Decoupling
The producer and consumer don't know about each other. This means you can:
THE KITCHEN ANALOGY : Think of it like a restaurant kitchen: The waiter puts the order on the ticket rail and immediately goes to serve other tables — they don't stand there waiting for the cook. The ticket rail decouples the front of house from the back of house, exactly like a message queue decouples producers from consumers.
How It Works Under the Hood
Acknowledgements (ACKs)
The problem: A worker pulls a message and starts processing. Halfway through, it crashes. If the queue deleted the message the moment the worker grabbed it, that photo is lost forever.
The solution: When a consumer pulls a message, the queue doesn't delete it right away. The consumer must explicitly send an acknowledgement (ACK) back to the queue saying: "I'm done with this one. You can delete it now."
If a consumer crashes before sending the ACK, the queue assumes it wasn't processed and redelivers it to another consumer. Nothing is lost.
Preventing Duplicate Processing
While Worker A is processing a message (hasn't ACK'd yet), the message is technically still in the queue. What stops Worker B from grabbing it too?
Different systems handle this differently:
The concept is always the same: every queuing system needs a way to ensure a message is only being actively processed by one consumer at a time.
Delivery Guarantees
Even with ACKs, there's a tricky edge case: What if a worker processes a message successfully but crashes right before sending the ACK? The queue thinks it was never processed, so it redelivers it — and the same work happens twice.
For photo resizing, this is harmless. But for "Charge someone $50", a duplicate means he gets charged $100.
There are three delivery guarantees:
1. At-Least-Once Delivery (Most Common)
Every message is delivered at least one time, but it might be delivered more than once.
Implication: Consumers must be idempotent — processing the same message twice produces the exact same result.
Recommended by LinkedIn
IDEMPOTENT (safe):
"Set user 123's profile photo to photo_5"
→ Running twice: photo is still photo_5.
NOT IDEMPOTENT (dangerous):
"Increment user 123's post count by 1"
→ Running twice: count goes up by 2.
FIX: Rephrase as idempotent operation:
"Set user 123's post count to 54"
→ Running twice: count is still 54.
2. At-Most-Once Delivery (Fire and Forget)
The message is deleted immediately when a consumer takes it. If something goes wrong, at most one consumer processed it — at worst, nobody did.
Use case: Analytics events or metrics where losing a few data points is acceptable.
3. Exactly-Once Delivery (The Holy Grail)
Every message is processed exactly one time. True exactly-once is extremely hard to achieve in distributed systems. Kafka supports a form of it for specific patterns within its own ecosystem, but it comes with real trade-offs and limitations.
When to Use a Message Queue?
Four signals to look for:
Warning: Don't Queue Synchronous Workloads
If you have strong latency requirements (e.g., sub-500ms response times), adding a queue nearly guarantees you'll break that constraint. Queues add complexity around getting results back to the client, and inherently introduce delay.
Scenarios & Work around them
1. Increased Throughput: Scaling — Partitions & Consumer Groups
A single queue can only handle so much. To scale, you partition — split the queue into multiple independent sub-queues.
Choosing the Partition Key
The partition key determines which message goes to which partition(something like shard key). It matters for two reasons:
Ordering : Messages with the same partition key always go to the same partition. Within a partition, order is guaranteed. Right order is really necessary for some scenario. (e.g. bank transaction).
Even Distribution: You want partition keys that spread work evenly. As example, for a ride sharing app like pathao -
BAD key: city
Partition "Dhaka" → slammed (hot partition 🔥)
Partition "Rajshahi" → idle
GOOD key: ride_id
Evenly distributed across all partitions
Trade-off: The key that gives you ordering might not be the key that gives you the best distribution. Choosing the right partition key is worth careful thought around both factors.
2. Back Pressure: When Producers Outpace Consumers
If producers create messages faster than consumers can process them, the queue grows indefinitely. A queue doesn't solve a capacity problem — it just delays it.
Producers: 300 msg/sec
Consumers: 200 msg/sec
─────────────────────────
Queue growth: +100 msg/sec → eventually runs out of memory
Three ways to handle:
A queue is a buffer, not a solution to insufficient capacity.
3. Failed Messages & Dead Letter Queues (DLQ)
What if a message always fails? (e.g., a corrupted photo that will never process successfully.) This is called a poison message — it crashes the consumer every time and can never recover.
Without guardrails, it retries forever, blocking everything behind it.
Solution: Max Retry Count + Dead Letter Queue
Mentioning DLQs proactively shows seniority and understanding of real failure scenarios.
4. Durability & Fault Tolerance: What If the Queue Goes Down?
Modern message queues like Kafka persist messages to disk and replicate them across multiple brokers (servers).
Replay scenario: Consumers go down for an hour. The Kafka queue backs up — no big deal. When consumers come back, they process the backlog. Even more powerful: if a consumer had a bug and processed things incorrectly, you can deploy a new consumer and tell it to reprocess from an hour ago, even though those messages were already consumed.
Common Message Queue Technologies
You don't need to know all of these, but you should be comfortable talking about at least one. If you don't have a default, choose Kafka.
Summary & Key Takeaways
If you enjoyed this, check out my other guides.