Resiliency Patterns in Distributed Systems: Circuit Breakers, Retries, and Beyond!

Rowshan Kibria

Published Aug 22, 2025

In the world of modern software, our applications are no longer monoliths. They are distributed systems. A collection of microservices, databases, APIs, and caches, all talking to each other over a network. A transient glitch, a slow database, a third-party API having a bad day - any of these can cause a cascade of failures, taking our entire application down. This isn't a theoretical problem; it's a daily battle.

Why Resiliency Patterns?

In a perfect world, every API responds instantly, every database is always online, and the network never drops. But reality is different - services go down unexpectedly, networks get flaky, and sometimes a database takes too long to respond.

Imagine this: Service A calls Service B, but Service B is slow today. Requests from Service A start piling up, threads get blocked, and soon Service A runs out of capacity and crashes. Now Service C, which depends on Service A, also starts failing. What began as a small hiccup has now turned into a system-wide outage.

That’s where resiliency patterns come in. They don’t remove failures, but they make our system gracefully handle them. Think of them as airbags in a car, our goal is not to crash, but if we do, we want to survive. We needed patterns to stop this:

Retry - Try again after failure, but with limits and backoff.
Circuit Breaker - Stop calling a broken service for a while.
Timeouts - Don’t wait forever; fail fast.

Life Before .NET Native Support

For years, .NET developers didn't have built-in solutions. Our hero was an open-source library called Polly. Polly was, and still is, brilliant. It gave us the power to define these policies with code:

// The "old way" - powerful, but we had to wire everything up manually
var retryPolicy = Policy
  .Handle<HttpRequestException>()
  .WaitAndRetryAsync(3, attempt => TimeSpan.FromSeconds(Math.Pow(2, attempt)));

var circuitBreakerPolicy = Policy
  .Handle<HttpRequestException>()
  .CircuitBreakerAsync(5, TimeSpan.FromSeconds(30));

// Combining them was powerful but complex
var resiliencePolicy = Policy.WrapAsync(retryPolicy, circuitBreakerPolicy);

await resiliencePolicy.ExecuteAsync(async () => {
    // Your fragile API call here
    await httpClient.GetAsync(...);
});

This was a lifesaver! But it came with a cost: complexity. We had to manage policies, integrate them with HttpClient correctly, and figure out telemetry ourselves.

The Renaissance : .NET 9's Native Resilience

The .NET team saw how critical this was and did something amazing: they officially embraced Polly and built a first-class, streamlined experience right into the framework. In .NET 8 and 9, resiliency isn't a third-party add-on; it's a feature. The Microsoft.Extensions.Http.Resilience and Microsoft.Extensions.Resilience are a first-party, official wrapper and integration layer on top of the Polly library is a game-changer. Polly is the underlying, powerful, open-source resiliency library that provides the core implementations of the Retry, Circuit Breaker, Timeout, Bulkhead, and Hedging policies.

1. Standard Resilience Handler (AddStandardResilienceHandler)

Purpose: The "classic" pattern for correctness. Handles faults by retrying failed requests. Best for safe, non-destructive operations (e.g., GET, POST, PUT).

.AddStandardResilienceHandler(options =>
{
    options.RetryOptions.MaxRetryAttempts = 3;
    options.RetryOptions.BackoffType = DelayBackoffType.Exponential;
    options.CircuitBreakerOptions.FailureRatio = 0.3; // Trip at 30% failure rate
    options.TotalRequestTimeoutOptions.Timeout = TimeSpan.FromSeconds(10);
});

RetryOptions

Purpose: To handle transient (temporary) failures automatically. These are failures that might succeed if you just try again, like a network glitch, a momentary timeout, or a deadlocked database connection.

MaxRetryAttempts: The maximum number of times to retry a failed request.
Delay: The delay between retry attempts (e.g., imeSpan.FromSeconds(2)).
BackoffType: How the delay grows: Fixed, Linear, or Exponential.

CircuitBreakerOptions

Purpose: To stop making requests to a service that is failing or overwhelmed. It's a proactive pattern designed to prevent catastrophic failure and give the failing service time to recover.

FailureRatio: The ratio of failures (e.g., 0.5 for 50%) needed to open (trip) the circuit.
SamplingDuration: The time window over which failures are sampled.
MinimumThroughput: The minimum number of actions in the sampling window before the circuit can trip.
BreakDuration: How long the circuit stays open and blocks requests before retrying.

Recommended by LinkedIn

Building Resilience: How Distributed Systems and Fault…

Sajan Gautam 1 year ago

Monoliths vs Distributed Systems: The hidden costs

Pedro B. 3 months ago

What the AWS Outage Teaches Us About Architecture

James Tennant 6 months ago

TimeoutOptions

Purpose: To ensure a single request doesn't hang forever waiting for a response. It sets a strict time limit for any single attempt to complete.

Timeout: The timeout for each individual attempt.

TotalRequestTimeoutOptions

Purpose: To set a hard limit on the total time a user (or calling service) has to wait for an operation, including all retry attempts.

Timeout: The overall timeout for the entire operation, including all retries.

2. Standard Hedging Handler (AddStandardHedgingHandler)

Purpose: The "performance" pattern for speed. Proactively sends request replicas to beat latency. Best for read-only, idempotent operations (e.g., GET).

.AddStandardHedgingHandler(options =>
{
    options.HedgingOptions.MaxHedgingAttempts = 4; // 1 original + 3 hedges
    options.HedgingOptions.HedgingDelay = TimeSpan.FromMilliseconds(400);
    options.EndpointOptions.CircuitBreakerOptions.FailureRatio = 0.5;
    options.TotalRequestTimeoutOptions.Timeout = TimeSpan.FromSeconds(3);
});

HedgingOptions

Purpose: To reduce tail latency (the very slowest requests) by firing multiple, redundant requests to one or more services after a short delay, and using the first successful response that comes back.

MaxHedgingAttempts: Total number of requests to send (original + hedges). If set to 3, the pattern will send up to 3 identical requests in total (the original + 2 "hedged" requests). It will accept the first positive response it gets from any of them.
HedgingDelay: How long to wait for the first attempt to complete before hedging. If the first request succeeds in 100ms, a hedge is never sent. This is efficient. If the first request is still ongoing after 500ms, the pattern fires off a second request to the same (or a different) service instance. If that one doesn't return after another 500ms, it might fire a third (if MaxHedgingAttempts is 3).

EndpointOptions

Purpose: To intelligently decide where to send the hedged requests. When you hedge, you don't necessarily want to send the duplicate request to the exact same unhealthy server. That would be pointless. This is where EndpointOptions and its CircuitBreakerOptions come in.

How it works: The system maintains a separate circuit breaker for each individual endpoint (e.g., each instance of a backend service: Service-A/Instance-1, Service-A/Instance-2).
The Benefit: When the hedging pattern decides to send a hedged request, the underlying client can use this information to route the hedged request to a different, healthier instance than the first request was sent to.
CircuitBreakerOptions: A circuit breaker for each endpoint. Routes requests away from failing instances.

TimeoutOptions

Timeout: The timeout for each individual attempt.

TotalRequestTimeoutOptions

Timeout: The overall timeout for the entire hedging operation

Key Difference: Notice the EndpointOptions.CircuitBreakerOptions. This is why hedging is so powerful. It doesn't just have one circuit breaker for the service, it has a dedicated one per endpoint URI, allowing it to intelligently route away from unhealthy instances. Hedging is our secret weapon to make your applications feel incredibly fast and responsive, even when parts of your system are having a slow day.

Closing Thoughts

Resiliency patterns are not “nice-to-have” anymore. They’re essential. What once required heavy third-party libraries is now part of the .NET platform itself. The journey from complex Polly configurations to .NET 9's built-in AddStandardResilienceHandler and AddStandardHedgingHandler is a huge win for developers. It allows us to focus on our business logic, knowing that our applications are armed with the patterns they need to survive in the chaotic reality of a distributed world.

Kartik Chandra Biswas 7mo

Nice post 👍

1 Reaction

Suday Kumer Ghosh 8mo

Thanks for sharing

1 Reaction

See more comments

To view or add a comment, sign in

Resiliency Patterns in Distributed Systems: Circuit Breakers, Retries, and Beyond!

Rowshan Kibria

Why Resiliency Patterns?

Life Before .NET Native Support

The Renaissance : .NET 9's Native Resilience

1. Standard Resilience Handler (AddStandardResilienceHandler)

RetryOptions

CircuitBreakerOptions

Recommended by LinkedIn

TimeoutOptions

TotalRequestTimeoutOptions

2. Standard Hedging Handler (AddStandardHedgingHandler)

HedgingOptions

EndpointOptions

TimeoutOptions

TotalRequestTimeoutOptions

Closing Thoughts

More articles by Rowshan Kibria

Others also viewed

Why Distributed Systems Need Paxos or Raft: Achieving Strong Consistency Amid Latency and Failures

Introduction to Monitoring the Container Environment with Prometheus and Grafana

Design with Resiliency in Mind: "Everything Fails, All the Time" (Tech Tuesdays #3)

Circuit Breaker 🔌⚡⛔ Building Resilience to Prevent Cascading Failures in Distributed Systems

Resilience in Microservices : Introducing the Circuit Breaker Pattern

What is SRE really?

Deploying and Failing in Distributed Systems

The Hidden Cost of CLI Scripts: Why Network Velocity Is Now a Business Model

How LinkedIn Tests Distributed Systems at Scale - And What Engineering Teams Can Learn

The Complete Resilience Stack: Building Distributed Systems That Survive Anything

Explore content categories

Why Resiliency Patterns?

Life Before .NET Native Support

The Renaissance : .NET 9's Native Resilience

1. Standard Resilience Handler (AddStandardResilienceHandler)

RetryOptions

CircuitBreakerOptions

Recommended by LinkedIn

TimeoutOptions

TotalRequestTimeoutOptions

2. Standard Hedging Handler (AddStandardHedgingHandler)

HedgingOptions

EndpointOptions

TimeoutOptions

TotalRequestTimeoutOptions

Closing Thoughts

More articles by Rowshan Kibria

.NET 10 and C# 14: A Developer’s Perspective (Part 1)

Explore the Evolution of .NET: Key Features from .NET 7 to 9 Preview

Understanding Webhooks: The Simplest Way!

The API Evolution: Moving Beyond REST with GraphQL and gRPC

Serverless vs. Microservices: Choosing the Best Fit for Your Needs

Inside the Smartwatch Data Pipeline: Choosing the Right Message Broker!

Event-Driven Architecture Explained: Benefits, Challenges, and Use Cases

Designing Microservices: Best Practices and Pitfalls

The Evolution of Software Architecture: From Monoliths to Microservices and Beyond

Others also viewed

Why Distributed Systems Need Paxos or Raft: Achieving Strong Consistency Amid Latency and Failures

Introduction to Monitoring the Container Environment with Prometheus and Grafana

Design with Resiliency in Mind: "Everything Fails, All the Time" (Tech Tuesdays #3)

Circuit Breaker 🔌⚡⛔ Building Resilience to Prevent Cascading Failures in Distributed Systems

Resilience in Microservices : Introducing the Circuit Breaker Pattern

What is SRE really?

Deploying and Failing in Distributed Systems

The Hidden Cost of CLI Scripts: Why Network Velocity Is Now a Business Model

How LinkedIn Tests Distributed Systems at Scale - And What Engineering Teams Can Learn

The Complete Resilience Stack: Building Distributed Systems That Survive Anything

Explore content categories