DevOps lessons learned: Production drift and the importance of consistency

“It worked in dev… and that’s exactly why it scared me” A few weeks ago, we had a release Everything checked out: Same Docker image Same pipeline No risky changes We had already tested it in dev and staging. No issues. So we pushed to production thinking this would be a non-event. It wasn’t. What started happening Nothing broke immediately. Which, honestly, made it worse. After some time: A couple of APIs started timing out One service behaved… strangely (not failing, just inconsistent) Logs didn’t show anything obvious At first, it felt like one of those “maybe it’ll settle” situations. It didn’t. What confused us We kept going back to the same thought: “But this exact setup worked in staging…” Same image. Same configs (or so we thought). So why was production acting differently? What we eventually found After digging way deeper than expected, the issue wasn’t in the code at all. Production had quietly drifted. One environment variable was different A dependency version wasn’t exactly the same And someone (months ago) had patched something directly in prod Nothing big individually. But together, it changed behavior. That’s what got us. What we changed after that We didn’t just fix the issue and move on. That would’ve been a mistake. We tightened a few things: Moved everything we could into Terraform Standardized deployments using Docker (no environment-specific builds) Cleaned up configs and started managing them properly (used Ansible for consistency) And the biggest one: 👉 No more direct changes in production. If it’s not in code, it doesn’t exist. What stuck with me I used to think: “If it works in staging, we’re safe” Now I think: “How sure are we that staging is actually the same as prod?” Because most of the time… it isn’t. #DevOps #Terraform #Docker #Ansible #InfrastructureAsCode #CloudEngineering #SRE #LearningInPublic #RealWorldDevOps

To view or add a comment, sign in

Explore content categories