Jonathan C. Phelps’ Post

Most engineering teams I talk to can tell you exactly how many deployments they did last month. Almost none can tell you which tests actually ran across which clusters. I keep seeing the same pattern: Tests exist. Tools exist. Pipelines exist. But there's no single place to see what's passing, what's flaking, and what hasn't run in weeks. One team I spoke with recently runs infrastructure tests across dozens of clusters. Their test results? Scattered across individual pipeline runs. Shell scripts. No historical aggregation. No confidence metrics. They had synthetic health checks. Rolled them back. Too many false positives. Not because the tests were bad. Because there was no orchestration layer to separate signal from noise. This is the part nobody talks about when they say "we have good test coverage." Coverage without visibility is just hope with extra steps. How does your team track test health across clusters today? #Kubernetes #PlatformEngineering #DevOps #Testing #CloudNative

To view or add a comment, sign in

More Relevant Posts

Don Keeting
6d
Report this post
Debugging in production is not the same as debugging locally. Locally, everything is controlled. In production, you’re dealing with timing, dependencies, incomplete data, and behavior you didn’t anticipate. That’s where most real problems show up. #softwareengineering #devops #systemsdesign
Like Comment
To view or add a comment, sign in
Nikhil Garepally
5d
Report this post
Something I’ve seen multiple times while working on production systems: Code that works perfectly in lower environments… starts behaving differently in production. Not because the logic is wrong. But because real systems are far more complex than they appear. Recently, while working on a production deployment, everything looked stable — CI/CD pipelines were clean, deployments were successful, no obvious errors. But once it went live: • Unexpected latency started showing up • Dependencies behaved differently • Debugging took much longer than expected The challenge isn’t always the code itself. It’s how that code interacts with everything around it — infrastructure, services, and scale. This gap between “it works locally” and “it works in production” is something I keep seeing. Curious how others handle this in real-world systems. What’s your approach when things behave differently in production? #DevOps #CloudComputing #SoftwareEngineering #ProductionSystems #SystemDesign
Like Comment
To view or add a comment, sign in
Alice Smith
3w
Report this post
Work Insight One thing I’ve learned recently: Most production issues aren’t “complex” — they’re misunderstood. Clear logs, better observability, and asking the right questions solve more problems than fancy solutions. #DevOps #Debugging #EngineeringMindset
2 Comments
Like Comment
To view or add a comment, sign in
Shanuth Chimbli
2w
Report this post
🚨 Production Reality — When Kubernetes Looks Healthy but Your App Is Down One of the most confusing production situations is when everything in Kubernetes looks fine, but the application is still failing. The cluster is healthy, nodes are up, deployments are successful, and there are no obvious alerts. Yet users are facing errors, timeouts, or complete service disruption. This usually happens because Kubernetes shows infrastructure health, not actual application behaviour. A pod can be in a “Running” state but still fail internally due to application bugs, dependency issues, or misconfigurations. For example, a service might be running but unable to connect to a database, or an API might be returning errors even though the container itself hasn’t crashed. In real production scenarios, this creates confusion. Engineers check dashboards and see everything green, but the system is clearly not working. This is where deeper investigation begins - looking into logs, tracing request flows, checking dependencies, and validating configurations across services. A common example is when readiness or liveness probes are misconfigured. Kubernetes keeps restarting or marking pods as healthy incorrectly, leading to inconsistent behaviour. Similarly, network latency, DNS issues, or third-party API failures can break applications without affecting cluster-level metrics. The key takeaway is that Kubernetes ensures orchestration, not correctness of your application. Engineers need proper observability, including logs, metrics, and traces, to understand what is actually happening inside the system. Production is not about what looks healthy. It’s about what actually works. . . . #Kubernetes #DevOps #SRE #ProductionEngineering #CloudEngineering #Observability #IncidentResponse #PlatformEngineering #Reliability #Microservices
Like Comment
To view or add a comment, sign in
Andre Bonner
2w
Report this post
“It Worked on My Machine” Is a Process Problem We’ve all heard it. Maybe we’ve even said it. 😅 “It worked on my machine.” But that’s rarely a code problem. It’s a process problem. Development happens locally. Production runs somewhere else. Different: OS versions Environment variables Database states Dependency versions Hardware resources If environments aren’t consistent, behavior won’t be either. That’s why mature teams invest in: Containerization (e.g., Docker) Environment parity (dev ≈ staging ≈ production) CI pipelines Automated tests Infrastructure as code When systems are reproducible, excuses disappear. “It worked on my machine” usually means: We didn’t standardize the environment. Good engineering isn’t just writing code. It’s designing a process where the machine doesn’t matter. #SoftwareEngineering #DevOps #EnvironmentParity #SeniorDeveloper #EngineeringCulture
Like Comment
To view or add a comment, sign in
PerfoLogy - Learn Share Grow
3w
Report this post
🔥 Most performance tuning is a waste of time. Not because systems are complex — but because we optimize the wrong thing. I’ve seen: • Micro-optimizations with zero impact • Days spent chasing the wrong bottleneck • Real issues hiding in unexpected places The real skill is not tuning — it’s finding the bottleneck correctly. This video breaks down how to use: → Tracing → Metrics → Profiling to identify the real problem. 🎥 Watch here → https://lnkd.in/dW8gY6_7 👉 What’s the worst performance debugging experience you’ve had? #PerformanceEngineering #Observability #Backend #devops
Like Comment
To view or add a comment, sign in
Pinecone

76,601 followers
3w
Report this post
Every time a team says "performance regressed," one question should come first: which type? Infrastructure regressions are traditional DevOps problems — high latency, dropped throughput. Fix with monitoring and scaling. Result quality regressions require eval frameworks. Lower accuracy, more hallucinations. Fix by testing changes across the retrieval pipeline. Teams get stuck debugging latency when the real problem is chunking strategy. Or they overhaul retrieval when the issue is infrastructure capacity. Different scoreboards, different debugging approaches. Mixing them wastes time and ships broken products. https://lnkd.in/g7SpqvsE

1 Comment
Like Comment
To view or add a comment, sign in
Gaurang Kudale
3w
Report this post
🚀 Phase 2 of RCA Operator is starting now. Kubernetes gives you signals. But it doesn’t give you incidents. So debugging becomes: - chasing alerts - scanning logs - guessing root cause It shouldn’t be this hard. 🚀 That’s why I built RCA Operator An open-source Kubernetes operator that: → correlates cluster signals → reduces alert noise → generates structured incidents 🔥 Now starting Phase 2: We’re building: 🔗 Correlation Engine Logs + metrics + traces combined to pinpoint root cause 📊 Topology Visualization UI See service dependencies and failure propagation 📡 OpenTelemetry Integration Native tracing + distributed context 🔎 Smarter RCA Insights AI-powered root cause analysis & anomaly detection The goal is simple: 👉 Turn Kubernetes from “signals” → “clarity” --- This is still early, but moving fast. Would love your feedback, ideas, or contributions 🙌 🌐 https://rca-operator.tech ⭐ https://lnkd.in/dXJmhpSN Let’s build this together 🚀 #Kubernetes #DevOps #SRE #OpenSource #CloudNative #Observability Sharing a sneak peek of Phase 2 👇
2 Comments
Like Comment
To view or add a comment, sign in
0xMetaLabs

1,158 followers
2w
Report this post
Your system doesn’t break when it runs. It breaks when it changes. A deployment goes out. Tests pass. Pipelines are green. Minutes later latency spikes. A downstream service starts timing out. Retries kick in across the system. Nothing obvious failed. But something changed. We’ve spent years optimizing systems for stability at runtime. Auto-scaling. Redundancy. Failover. But the highest-risk moment in your system isn’t when it’s running. It’s when you touch it. Because modern deployments aren’t simple updates. They’re state changes across a distributed system. A config tweak here. A dependency update there. A schema change in another service. Each one safe in isolation. Together, unpredictable. Here’s where it breaks. Your system isn’t a single unit. It’s a network of assumptions: – Service A expects a certain response format – Service B assumes a timeout window – Service C depends on ordering guarantees A deployment doesn’t just change code. It invalidates assumptions. Here’s the mechanism most teams miss: Failures don’t happen because deployments go wrong. They happen because dependencies react differently than expected. So even when your change is correct… the system around it isn’t ready for it. At 0xMetaLabs, we’ve seen deployments where nothing in the release was technically broken but a small schema change caused downstream services to misinterpret data, triggering retries, timeouts, and eventually system-wide degradation. The uncomfortable truth: You don’t deploy into a system. You deploy into a web of hidden dependencies. CI/CD made deployments faster. It didn’t make them safer. The next evolution of reliability isn’t faster pipelines. It’s understanding what your system assumes before you change it. Because that’s where most failures actually begin. So here’s the real question: When you deploy… Are you testing your code or the assumptions your system depends on? #DevOps #DistributedSystems #SiteReliabilityEngineering #EnterpriseArchitecture #CloudComputing #0xMetaLabs
Like Comment
To view or add a comment, sign in
Jonathan C. Phelps
1w
Report this post
A compliance-heavy financial services team I worked with had a brutal testing loop. Every regression cycle: 2 days. Not 2 hours. Two full working days. Engineering would push a change Monday morning and find out if anything broke Wednesday afternoon. They couldn't use SaaS testing platforms because data couldn't leave their environment. So they ran everything manually, on-prem, against a half-working staging cluster that drifted from prod every weekend. After moving to Kubernetes-native test orchestration inside the cluster: - Regression cycle: 4 hours. - Test coverage: 4x. - A transaction bug that would've hit production got caught on a Tuesday afternoon instead of during a Friday night outage. The math that killed the old setup wasn't the tooling cost. It was the engineer-hours reclaimed. Roughly $150K/year of labor went back into building, not waiting. If your regression cycle is measured in days, the bottleneck isn't your engineers. It's where your tests are running. What does your regression cycle look like right now? #Kubernetes #DevOps #TestAutomation #CloudNative #PlatformEngineering
Like Comment
To view or add a comment, sign in

6,130 followers

View Profile Connect

Jonathan C. Phelps’ Post

More from this author

AI takes over writing: OpenAI wrote this article about the benefits of our software in 3 seconds, and I have to agree

Explore content categories

Jonathan C. Phelps’ Post

More Relevant Posts

More from this author

AI takes over writing: OpenAI wrote this article about the benefits of our software in 3 seconds, and I have to agree

Explore related topics

Explore content categories