Resiliency Patterns in Distributed Systems: Circuit Breakers, Retries, and Beyond!
In the world of modern software, our applications are no longer monoliths. They are distributed systems. A collection of microservices, databases, APIs, and caches, all talking to each other over a network. A transient glitch, a slow database, a third-party API having a bad day - any of these can cause a cascade of failures, taking our entire application down. This isn't a theoretical problem; it's a daily battle.
Why Resiliency Patterns?
In a perfect world, every API responds instantly, every database is always online, and the network never drops. But reality is different - services go down unexpectedly, networks get flaky, and sometimes a database takes too long to respond.
Imagine this: Service A calls Service B, but Service B is slow today. Requests from Service A start piling up, threads get blocked, and soon Service A runs out of capacity and crashes. Now Service C, which depends on Service A, also starts failing. What began as a small hiccup has now turned into a system-wide outage.
That’s where resiliency patterns come in. They don’t remove failures, but they make our system gracefully handle them. Think of them as airbags in a car, our goal is not to crash, but if we do, we want to survive. We needed patterns to stop this:
Life Before .NET Native Support
For years, .NET developers didn't have built-in solutions. Our hero was an open-source library called Polly. Polly was, and still is, brilliant. It gave us the power to define these policies with code:
// The "old way" - powerful, but we had to wire everything up manually
var retryPolicy = Policy
.Handle<HttpRequestException>()
.WaitAndRetryAsync(3, attempt => TimeSpan.FromSeconds(Math.Pow(2, attempt)));
var circuitBreakerPolicy = Policy
.Handle<HttpRequestException>()
.CircuitBreakerAsync(5, TimeSpan.FromSeconds(30));
// Combining them was powerful but complex
var resiliencePolicy = Policy.WrapAsync(retryPolicy, circuitBreakerPolicy);
await resiliencePolicy.ExecuteAsync(async () => {
// Your fragile API call here
await httpClient.GetAsync(...);
});
This was a lifesaver! But it came with a cost: complexity. We had to manage policies, integrate them with HttpClient correctly, and figure out telemetry ourselves.
The Renaissance : .NET 9's Native Resilience
The .NET team saw how critical this was and did something amazing: they officially embraced Polly and built a first-class, streamlined experience right into the framework. In .NET 8 and 9, resiliency isn't a third-party add-on; it's a feature. The Microsoft.Extensions.Http.Resilience and Microsoft.Extensions.Resilience are a first-party, official wrapper and integration layer on top of the Polly library is a game-changer. Polly is the underlying, powerful, open-source resiliency library that provides the core implementations of the Retry, Circuit Breaker, Timeout, Bulkhead, and Hedging policies.
1. Standard Resilience Handler (AddStandardResilienceHandler)
Purpose: The "classic" pattern for correctness. Handles faults by retrying failed requests. Best for safe, non-destructive operations (e.g., GET, POST, PUT).
.AddStandardResilienceHandler(options =>
{
options.RetryOptions.MaxRetryAttempts = 3;
options.RetryOptions.BackoffType = DelayBackoffType.Exponential;
options.CircuitBreakerOptions.FailureRatio = 0.3; // Trip at 30% failure rate
options.TotalRequestTimeoutOptions.Timeout = TimeSpan.FromSeconds(10);
});
RetryOptions
Purpose: To handle transient (temporary) failures automatically. These are failures that might succeed if you just try again, like a network glitch, a momentary timeout, or a deadlocked database connection.
CircuitBreakerOptions
Purpose: To stop making requests to a service that is failing or overwhelmed. It's a proactive pattern designed to prevent catastrophic failure and give the failing service time to recover.
Recommended by LinkedIn
TimeoutOptions
Purpose: To ensure a single request doesn't hang forever waiting for a response. It sets a strict time limit for any single attempt to complete.
TotalRequestTimeoutOptions
Purpose: To set a hard limit on the total time a user (or calling service) has to wait for an operation, including all retry attempts.
2. Standard Hedging Handler (AddStandardHedgingHandler)
Purpose: The "performance" pattern for speed. Proactively sends request replicas to beat latency. Best for read-only, idempotent operations (e.g., GET).
.AddStandardHedgingHandler(options =>
{
options.HedgingOptions.MaxHedgingAttempts = 4; // 1 original + 3 hedges
options.HedgingOptions.HedgingDelay = TimeSpan.FromMilliseconds(400);
options.EndpointOptions.CircuitBreakerOptions.FailureRatio = 0.5;
options.TotalRequestTimeoutOptions.Timeout = TimeSpan.FromSeconds(3);
});
HedgingOptions
Purpose: To reduce tail latency (the very slowest requests) by firing multiple, redundant requests to one or more services after a short delay, and using the first successful response that comes back.
EndpointOptions
Purpose: To intelligently decide where to send the hedged requests. When you hedge, you don't necessarily want to send the duplicate request to the exact same unhealthy server. That would be pointless. This is where EndpointOptions and its CircuitBreakerOptions come in.
TimeoutOptions
TotalRequestTimeoutOptions
Key Difference: Notice the EndpointOptions.CircuitBreakerOptions. This is why hedging is so powerful. It doesn't just have one circuit breaker for the service, it has a dedicated one per endpoint URI, allowing it to intelligently route away from unhealthy instances. Hedging is our secret weapon to make your applications feel incredibly fast and responsive, even when parts of your system are having a slow day.
Closing Thoughts
Resiliency patterns are not “nice-to-have” anymore. They’re essential. What once required heavy third-party libraries is now part of the .NET platform itself. The journey from complex Polly configurations to .NET 9's built-in AddStandardResilienceHandler and AddStandardHedgingHandler is a huge win for developers. It allows us to focus on our business logic, knowing that our applications are armed with the patterns they need to survive in the chaotic reality of a distributed world.
Nice post 👍
Thanks for sharing