Debugging 90% of Production Issues: Ask What's Changed

A team I saw spent 3 days debugging intermittent 500 errors. 3 engineers. Reading code line by line. Going through every service, every endpoint. Nobody found anything. Then someone new joined the call and asked one question: "Did anything change in the last week?" Turns out a config change had reduced a connection timeout from 30 seconds to 3 seconds. Writes to a slow external service were timing out. Cascading failures downstream. The fix: revert one config value. 2 minutes. 3 days of code reading. 2 minutes of behavior observation. This is the pattern I see over and over: → Teams debug by reading code → The bug isn't in the code → It's in a config, a dependency, a traffic pattern, an infrastructure change 90% of production issues trace back to something that changed. Not something that was written wrong. Before opening any file, ask: "What's different between when it worked and when it didn't?" That one question is worth more than 10,000 lines of code review. Skill #4 of 12 AI-proof engineering skills. → Follow for the full series. — #AIProofSkills #SoftwareEngineering #Debugging #ProductionDebugging #EngineeringLessons #SystemsThinking #Engineering #BuildInPublic

  • text

The scary part: the config change was pushed by someone on a completely different team. Nobody on the debugging call even knew it happened. This is why "what changed?" beats "what's broken?" as a debugging question. Broken assumes the code is wrong. Changed opens up the full picture. What's the weirdest non-code bug you've ever found? 👇

Like
Reply

To view or add a comment, sign in

Explore content categories