How overusing SOLID pattern turned a one-line fix into a 14-Hour outage

How overusing SOLID pattern turned a one-line fix into a 14-Hour outage

This is a story about a team that knew SOLID cold and it cost them a 14-hour production outage. It happened earlier this year, and it's the reason I'm writing this article.

A fintech processing ~$2.3B in daily interbank settlements went dark. Root cause? A single timeout value. The fix? One line. But finding that line took the entire team 11 hours because the Dependency Inversion principle (https://en.wikipedia.org/wiki/Dependency_inversion_principle) had been applied so religiously that the codebase became a black box, imagine it as you have given hundreds of shelves and keys with no marks of doors, terrible idea when you need to get inside 😇

I was brought in as a consultant to fix the issue. What I found was not a story about bad developers. It was a story about good developers who took "depend on abstractions, not concretions" and ran with it until the concretions were nowhere to be found.

The stack

The settlement engine: C++17, Boost.Asio, Boost.Beast, gRPC, RocksDB, and - here's the red flag: a hand-rolled 2,000-line ServiceLocator that wired every interface to its implementation at startup.

Five engineers, all senior. A couple of years earlier they'd adopted SOLID for testability. Good idea. But the team fell hardest for the "D" in SOLID -- the Dependency Inversion Principle (DIP): every boundary got a pure virtual interface, every concrete type lived behind the container, and code reviews treated direct instantiation like a health code violation.

What went wrong

Early morning UTC. The settlement engine starts rejecting reconciliation messages from a counterparty bank, our AI based bot triggers an incident immediately for it and follow up people in corporate messenger (yeah we made it quite aggressive to push lazy folks 😉)

[WARN] ReconciliationMessageHandler: message dropped — upstream timeout exceeded
        

Config says timeout is 30 seconds. Counterparty responds in ~800ms. Makes no sense. On-call engineer checks network, DNS, TLS certs. Everything looks fine. We continue digging out further...

The challenge

To understand why finding the bug took so long, you need to see what DIP did to this codebase.

Transport layer: Instead of calling Boost.Beast directly (think of it as C++'s answer to libcurl - HTTP/WebSocket library built on top of Boost.Asio's async I/O), the team built ITransportLayer, ITransportFactory, ITransportConfigurator, ITransportHealthChecker, ITransportMetricsCollector, ITransportRetryPolicy.

6 interfaces. One implementation each. Only HTTPS was ever used. That's not abstraction - that's bureaucracy!

Message pipeline: Every incoming message passed through 23 IMessageProcessor stages wired via the service container. The pipeline factory alone was 400 lines of resolve<ISomething>() calls:

Article content

Every resolve<>() hides a concrete type behind a lookup registered elsewhere. You can't grep for the actual call chain. You can't step through it without a debugger and a strong coffee or maybe whisky 😂

Timeout handling - and this is where the fun stops. The team split timeouts across 6 virtual interfaces, each with one implementation:

ITimeoutPolicy -> ITimeoutCalculator -> ITimeoutBudgetAllocator -> ITimeoutMonitor -> ITimeoutStrategy -> ITimeoutMetrics

6 interfaces merely to enforce a timeout, gosh...

Here's the key idea: instead of giving the whole pipeline a simple 30-second deadline (and letting each stage take whatever time it needs within that), the timeout architecture divided the 30 seconds equally across every stage - like splitting a restaurant bill evenly even though one person ordered a steak and everyone else had water.

total_budget  = 30 seconds (from config)
stages        = 23 processors in the pipeline
per_stage_allowance = 30 / 23 = ~1.3 seconds

If ANY single stage takes longer than 1.3s → entire message killed        

Here's what that looks like visually. What the on-call engineer thought was happening:

  What the config says:  30s timeout
  What the bank needs:   1.4s

  |█░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░| 30s
   ^bank (1.4s)                        Plenty of room!        

What was actually happening - the 30s split into 23 equal slices:

  |█|█|█|█|█|█|█|█|█|█|█|█|█|█|█|█|█|█|█|█|█|█|█|
  each █ = 1.3s allowance

  Stage 1:  validate schema   [0.0001s] ✓  fits easily
  Stage 2:  check idempotency [0.0003s] ✓  fits easily
  ...
  Stage 14: HTTP call to bank  [1.4s]   ✗  KILLED (needs 1.4s, allowed 1.3s)
  ...
  Stage 23: write audit log   [0.0002s] ✓  never reached        

The network call is the only stage that actually NEEDS time - gets the same tiny slice as stages that finish in microseconds. The 30-second total budget was never even close to being exhausted. The message was killed with 28.6 seconds remaining from allocated timeout.

3 months before the incident, a developer added four lightweight regulatory processors. They finished in microseconds - but they bumped the stage count from 19 to 23, silently shrinking every stage's allowance from ~1.58s to ~1.3s. Nobody noticed, you simply added a processor that runs in 50 microseconds. What could go wrong?

That morning, the counterparty deployed a database migration that temporarily pushed their response time from 800ms to 1.4 seconds. The 30-second total timeout was nowhere near being hit. But the per-stage allowance was 1.3 seconds, and the bank needed 1.4. Every single message timed out.

One developer adds 4 tiny processors -> few months later, an unrelated deployment at a partner bank triggers a full outage. Because division shrinked it from 1.58 down to 1.3s.

The 11-hours debugging marathon

this is the time history from AI driven troubleshooting channel:

Hour 1-2: On-call engineer finds pipeline.timeout.base = 30s in config. Counterparty latency: 1.4s. "Timeout can't be the issue." Starts chasing network ghosts.

Hour 3-4: A second team member greps for the error, lands in TimeoutEnforcementProcessor. Sees it checking a budget value typed as ITimeoutBudgetAllocator&. Searches for implementations - finds 3 (2 are dead code). Spends an hour tracing through 2000 lines of container registrations to figure out which one is actually running. In a codebase with direct instantiation, this is a 10-second grep.

Hour 5-7: Another developer tries to reproduce locally. Test container only has 5 processors. Per-stage budget: 6 seconds. Bug doesn't reproduce. By now the whole team is online, and they're starting to question reality.

Hour 8-9: They deploy extra logging to print the actual computed budget. Log shows 1.3 seconds. Everyone stares at the number. "Where does 1.3 come from?"

Hour 10-11: Someone counts processors in the production container. Twenty-three. Does the math: 30 / 23 = 1.3. Silence. Then several words that can't be printed here. That's when they called me around 2AM 🫣

The fix:

Article content

What Dependency Inversion principle actually costs in C++

In Java, over-abstraction costs you readability. In C++, it costs you readability and performance. Each virtual interface means:

  • Heap allocation - std::unique_ptr<ITimeoutPolicy> instead of a TimeoutPolicy value on the stack. Goodbye value semantics, hello cache miss 😝
  • Defeated inlining - the compiler cannot see through a vtable. In a hot path doing tens of thousands of settlements per second, this is not theoretical this is a known fact even for middle c++ devs.
  • Invisible runtime types - "which implementation is actually running?" becomes a question that requires a debugger instead of grep.

C++17 gives you better tools. std::variant with std::visit provides type-safe polymorphism without heap allocation. Templates give you compile-time decoupling with zero runtime cost. You get dependency inversion benefits without paying the Java tax. But this team imported the Java pattern wholesale and never asked whether C++ had a better way.

What we've done after the incident

After the incident, we've replaced 6 timeout interfaces with this:

Article content

And replaced the 23-processor pipeline with direct composition:

Article content

4 steps 1 file. No container. The timeout is a concrete member - stack-allocated, inlineable, and you can understand the entire flow before your coffee gets cold 😀

Practical rules for C++ devs

Here's what the team pinned to their code review guidelines after the incident:

1. Invert at the edges, not in the guts. Interfaces belong where your code meets the outside world: database adapters, HTTP clients, message queues. Your timeout calculator does not need an ITimeoutCalculator. It needs to be readable.

2. One implementation = no interface. An interface with a single implementation is not abstraction. you can always extract an interface later when you have a real second implementation. In c++, that refactoring takes an afternoon, not a sprint.

3. If grep can't follow a request through your system, you've over-inverted. If understanding "what happens when a message arrives" requires a debugger, a whiteboard, and 3 senior engineers - your architecture is not helping.

4. Use the language, not the pattern. std::variant + std::visit gives you type-safe polymorphism with value semantics. Templates give you compile-time decoupling. std::function gives you type-erased callbacks. Reach for these before reaching for class ISomething.

5. Benchmark your abstractions. Every virtual is a vtable lookup the compiler can't optimize away. Measure and test it. In a settlement engine, 12% throughput is real money. In a game engine, it's dropped frames. Both lead to costs...

My preferable resource: "Clean Architecture" by Robert C. Martin - where SOLID was explained.

The bottom line

Our industry keeps making the same mistake. Every few years there is a new trend, and teams adopt it without asking "do we actually need this here?"

In the 2000s it was design patterns - teams shoved factories, builders and visitors into code that needed a plain function. In the 2010s it was microservices - teams broke working apps into hundreds of tiny services that fell apart at the first network hiccup. Now it is clean architecture and SOLID everywhere - teams split a 50-line class into six interfaces because a principle told them to.

The mistake is always the same: we treat patterns as goals instead of tools.

And remember, when someone in a code review says "this breaks SOLID (the dependency inversion principle, etc" ask them one question: "What actual problem will that cause?" If the only answer is "it breaks the principle" - your code is 100% fine

The best engineers I have worked with do not reach for patterns first. They reach for clarity. And their systems do not wake people up at 2 AM.



To view or add a comment, sign in

More articles by Nikolay Shevchenko

Others also viewed

Explore content categories