Saga Design Pattern in Microservices (With Code & Interview Questions)

Saga Design Pattern in Microservices (With Code & Interview Questions)

Imagine you’re booking a holiday package online. You pick a flight, reserve a hotel, and rent a car — all in one go.

Each of these is handled by a different service (Flight Service, Hotel Service, Car Rental Service), with its own database.

Now, what happens if:

  • Your flight gets booked successfully.
  • The hotel reservation fails.
  • And the car rental never even gets called

Do you still want the flight booked while everything else failed? Of course not. You’d expect the whole transaction to roll back gracefully.

But here’s the problem: in microservices, there’s no single database transaction that magically undoes everything. Each service is independent.

That’s where the Saga Design Pattern steps in.

Think of it like a chain of promises with backup plans. Each step in the process has a “happy path” (book the hotel) and a “sorry, let’s undo it” path (cancel the flight if the hotel fails).

The Saga makes sure the entire journey either succeeds as a whole or gracefully compensates when things fall apart.

What Is the Saga Pattern:

A Saga is a sequence of local transactions. Each transaction updates a service’s database and then triggers the next step via an event or a command.

If one transaction fails, instead of rolling back everything, the Saga executes a series of compensating transactions to undo the previous work.

Simple Example: Travel Booking

  • Step 1: Reserve a flight.
  • Step 2: Book a hotel.
  • Step 3: Rent a car.

If Step 2 fails (hotel booking), you don’t roll back the flight in the database directly. Instead, you compensate by cancelling the flight reservation.

import java.util.*;

// Represent one step in the Saga
class SagaStep {
    Runnable action;
    Runnable compensation;

    SagaStep(Runnable action, Runnable compensation) {
        this.action = action;
        this.compensation = compensation;
    }
}

// Saga orchestrator
class Saga {
    List<SagaStep> steps = new ArrayList<>();

    void addStep(SagaStep step) {
        steps.add(step);
    }

    void execute() {
        Stack<SagaStep> completed = new Stack<>();
        try {
            for (SagaStep step : steps) {
                step.action.run();
                completed.push(step);
            }
            System.out.println("Saga completed successfully!");
        } catch (Exception e) {
            System.out.println("Saga failed: " + e.getMessage());
            System.out.println("Running compensations...");

            while (!completed.isEmpty()) {
                completed.pop().compensation.run();
            }
        }
    }
}

// Example usage
public class TravelBookingSaga {
    public static void main(String[] args) {
        Saga saga = new Saga();

        // Step 1: Reserve Flight
        saga.addStep(new SagaStep(
            () -> {
                System.out.println("Flight reserved");
            },
            () -> {
                System.out.println("Cancel flight reservation");
            }
        ));

        // Step 2: Book Hotel
        saga.addStep(new SagaStep(
            () -> {
                System.out.println("Hotel booked");
                // Simulate failure here
                throw new RuntimeException("Hotel booking failed!");
            },
            () -> {
                System.out.println("Cancel hotel booking");
            }
        ));

        // Step 3: Rent Car
        saga.addStep(new SagaStep(
            () -> {
                System.out.println("Car rented");
            },
            () -> {
                System.out.println("Cancel car rental");
            }
        ));

        saga.execute();
    }
}        

Sample Output (when hotel booking fails):

Flight reserved
Hotel booked
Saga failed: Hotel booking failed!
Running compensations...
Cancel flight reservation        

Why Do We Need Sagas:

In monolithic systems, transactions are straightforward:

  • A single database.
  • ACID transactions.
  • Rollback happens automatically.

But microservices don’t play by those rules:

  • Each service owns its database.
  • No global transaction manager.
  • ACID gives way to BASE (Basically Available, Soft State, Eventually Consistent).

Without something like Saga, you’re stuck with two bad choices:

  1. Distributed 2PC (Two-Phase Commit): Strong consistency but slow, complex, and a single point of failure.
  2. Hope and Pray: Fire calls sequentially and hope nothing breaks. Spoiler: it breaks.

Saga is the middle ground — it gives you eventual consistency with reliable compensation mechanisms.

Saga Execution Styles:

There are two main ways to orchestrate Sagas:

1. Choreography (Event-Driven):

Think of choreography like a flash mob dance:

  • Each dancer (service) knows their steps.
  • They just react when the music changes or when someone else performs a move.
  • There’s no “director” on stage telling them what to do.

How it works:

  • Each service publishes an event after completing its part.
  • Other services subscribed to that event react accordingly.
  • The flow of the saga is driven entirely by these events.

Example: Order Processing

// Order Service
eventBus.publish("OrderCreated", orderId);

// Payment Service
eventBus.on("OrderCreated", (orderId) -> {
    System.out.println("Processing payment for " + orderId);
    eventBus.publish("PaymentProcessed", orderId);
});

// Inventory Service
eventBus.on("PaymentProcessed", (orderId) -> {
    System.out.println("Reserving stock for " + orderId);
    eventBus.publish("StockReserved", orderId);
});

// Shipping Service
eventBus.on("StockReserved", (orderId) -> {
    System.out.println("Shipping order " + orderId);
    eventBus.publish("OrderShipped", orderId);
});

// Payment Service - failure compensation
eventBus.on("OrderCreated", (orderId) -> {
    if (!processPayment(orderId)) {
        eventBus.publish("PaymentFailed", orderId);
    }
});

// Order Service reacts
eventBus.on("PaymentFailed", (orderId) -> {
    System.out.println("Cancelling order " + orderId);
});        

In real systems, this event bus could be Kafka, RabbitMQ, or any messaging system.

Pros:

  • No central dependency → easy to get started.
  • Very natural if your system is already event-driven.

Cons:

  • As more services get added, things get messy (“event spaghetti”).
  • Harder to trace the overall flow → debugging becomes painful.

Best For: Small workflows, where steps are simple and services are loosely coupled.

Article content

2. Orchestration (Centralized):

Orchestration is more like a conductor leading an orchestra:

  • The conductor (orchestrator service) decides when each instrument (service) plays.
  • Every musician (service) just follows instructions.

How it works:

  • A central Saga orchestrator coordinates the entire flow.
  • It tells each service what to do, step by step.
  • Services only need to execute their assigned tasks.

Example: Order Processing

class OrderOrchestrator {
    PaymentService paymentService;
    InventoryService inventoryService;
    ShippingService shippingService;

    public void processOrder(String orderId) {
        try {
            System.out.println("Paying for " + orderId);
            paymentService.pay(orderId);

            System.out.println("Reserving stock for " + orderId);
            inventoryService.reserve(orderId);

            System.out.println("Shipping " + orderId);
            shippingService.ship(orderId);

            System.out.println("Order completed successfully!");
        } catch (Exception e) {
            System.out.println("Failure: " + e.getMessage());
            System.out.println("Running compensations...");

            // Compensation (reverse order)
            shippingService.cancel(orderId);
            inventoryService.release(orderId);
            paymentService.refund(orderId);
        }
    }
}        

Pros:

  • Much easier to manage complex workflows with multiple conditions.
  • Central visibility: you can see exactly where in the flow things are.

Cons:

  • Orchestrator becomes a single point of control.
  • If overused, it can make the system feel less “micro” and more monolithic.

Best For: Large, complex workflows that need tight coordination.

Article content

Key Design Considerations

When designing Sagas, think about:

  1. Idempotency: Services should handle duplicate messages safely.

// Idempotency example (safe retry)
@PostMapping("/reserve")
public ResponseEntity<?> reserve(@RequestBody Request req) {
    if (alreadyReserved(req.orderId)) {
        return ResponseEntity.ok("Already processed");
    }
    // normal reservation logic
}        

2. Compensation Logic: Define how to undo actions. Sometimes compensation isn’t possible — in those cases, design for manual intervention.

3. Failure Handling: What happens if compensation itself fails?

4. Monitoring & Visibility: Sagas can fail silently without good logging. You’ll need observability tools.

5. Timeouts & Retries: What if a service is slow but not dead? Balance between retries and compensation.

Real-World Example: Order Processing System

Let’s say you’re building an e-commerce checkout system.

  1. Order Service: Create order.
  2. Payment Service: Process payment.
  3. Inventory Service: Reserve stock.
  4. Shipping Service: Schedule shipment.

If payment fails → cancel order. If inventory fails → refund payment and cancel order. If shipping fails → release inventory, refund payment, cancel order.

This is a Saga in action — each step has both a “do” and “undo” path.

Benefits:

  • Works naturally with microservices.
  • Avoids global locks or distributed 2PC.
  • Provides eventual consistency.
  • Scales better than traditional transactions.

Challenges and Trade-Offs:

  • Compensation logic is tricky and business-specific.
  • Debugging distributed workflows is painful.
  • Choreography can lead to tight coupling if not carefully managed.
  • Not suitable for strongly consistent scenarios (like financial ledgers).

Use Saga when:

  • You have long-running business processes across multiple services.
  • Eventual consistency is acceptable.
  • You want to avoid the complexity of 2PC.

Don’t use Saga when:

  • You need strict atomicity across services.
  • Compensating actions aren’t possible.
  • Latency requirements are extremely tight.

Interview Questions:

Below are some of the Interview Questions on Saga Design Pattern:

Q1. In the Saga Pattern, how do you ensure idempotency of compensating transactions in a distributed system where retries are common?

  • Compensating transactions must be idempotent, meaning multiple executions have the same effect as one.
  • Achieved by:

  1. Unique transaction IDs / correlation IDs → logged in a dedicated saga log table.
  2. Versioning or optimistic locking → ensuring state is not rolled back multiple times.
  3. At-least-once semantics → retries won’t corrupt state if the compensator checks whether rollback already happened.

  • Example: Cancelling a hotel booking should check whether the booking is still active. If already cancelled, it simply acknowledges without changing state again.

Q2. How would you handle the “double compensation problem” where two concurrent compensations might try to undo the same step in a Saga?

This happens when concurrent failures trigger rollback from multiple branches.

Solutions:

  1. Centralized Saga Orchestrator → ensures only one rollback path is executed.
  2. State machine enforcement → once a transaction is compensated, state is marked COMPENSATED and subsequent attempts are ignored.
  3. Idempotent compensators → even if double-invoked, they don’t cause side effects.

  • Advanced approach: Use compare-and-swap on transaction states in storage to prevent race conditions.

Q3. Compare Saga Pattern with Two-Phase Commit (2PC). Why would Saga be preferred in high-scale microservice architectures?

2PC:

  • Strong consistency.
  • High latency due to distributed locks.
  • Coordinator is a single point of failure.

Saga

  • Eventual consistency, but scalable.
  • No global locks; each step commits independently.
  • Compensations replace rollbacks.

Why Saga wins in microservices:

  • No distributed locks → higher throughput.
  • Works well in cloud-native systems with unreliable networks.
  • 2PC is infeasible with NoSQL stores that don’t support XA transactions.

Q4. In Saga orchestration, how do you prevent the “Orchestrator Bottleneck” problem when it becomes a single point of failure or performance choke?

Below are some techniques to prevent the “Orchestrator Bottleneck” problem when it becomes a single point of failure or performance choke:

  1. Stateless orchestrator + distributed saga log → orchestrator can fail and resume using the log.
  2. Sharding by Saga ID → multiple orchestrators manage different sagas.
  3. Event-driven choreography instead of orchestration for high-scale cases.
  4. Leader election (e.g., via ZooKeeper/Raft) ensures continuity if one orchestrator dies.

Q5. How do you design compensating transactions for non-reversible side effects (like sending emails, push notifications, or SMS)?

Not all actions are reversible. Strategies:

  1. Log + Reconciliation → Track that an email was sent, but don’t attempt rollback. Instead, send a follow-up correction email.
  2. Deferred dispatch → don’t send non-reversible actions until the saga reaches a safe commit point.
  3. Semantic compensation → e.g., instead of “unsending an email,” send a “cancellation/updated” email.

  • Principle: Never try to undo what’s irreversible — design semantic alternatives.

Q6. How would you detect and handle a stuck saga where one service does not respond indefinitely?

  • Use timeouts and deadlines at each step.
  • If a step exceeds its SLA, orchestrator triggers compensation for already completed steps.
  • Mechanisms:

  1. Dead-letter queues for failed messages.
  2. Retry with exponential backoff + circuit breakers.
  3. Monitoring & heartbeat mechanism → detect hung services.

  • In extreme cases, saga may escalate to a human/manual intervention workflow.

Q7. How do you ensure consistency across long-running sagas that span hours or days, where intermediate states may change due to external events?

Challenges: Data may drift while saga is running.

Solutions:

  1. Versioning / snapshot isolation → operations run against consistent versions of entities.
  2. Re-validation before commit → before finalizing, check if preconditions still hold.
  3. Compensation chaining → design compensations for partially outdated states.
  4. Event sourcing → replay domain events to rebuild a consistent state when saga resumes after long delays.

  • Example: An airline booking saga may hold seats for 24h. If customer delays payment, before committing, system must re-check availability.

Final Thoughts

Sagas aren’t about avoiding failure — they’re about failing predictably and recovering gracefully.

The Saga Design Pattern isn’t just theory — it’s the backbone of reliable distributed transactions in modern systems.

The key is not to treat it as a silver bullet. It comes with its own complexity, and designing good compensations is often harder than designing the happy path.

But once you understand it, Sagas give you a way to embrace microservices without sacrificing reliability.

At the end of the day, distributed systems are all about making trade-offs. Sagas simply make those trade-offs explicit — and manageable.

To view or add a comment, sign in

More articles by Er Deepak Kumar Kesharwani

Others also viewed

Explore content categories