Cloud Infrastructure Challenges

Explore top LinkedIn content from expert professionals.

  • View profile for Deepak Agrawal

    Founder & CEO @ Infra360 | DevOps, FinOps & CloudOps Partner for FinTech, SaaS & Enterprises

    18,557 followers

    Over the last 1 year, we helped 15+ companies cut their cloud bills by 30-40% in 45 days (without a single new tool). Here’s what most cloud teams don’t realize: ❌ You don’t have a cost problem. ✅ You have a waste problem hidden in plain sight. We attacked the invisible waste buried deep in their Kubernetes clusters: 1. 𝐑𝐞𝐪𝐮𝐞𝐬𝐭𝐬 𝐚𝐧𝐝 𝐋𝐢𝐦𝐢𝐭𝐬 𝐖𝐞𝐫𝐞 𝐒𝐞𝐭… 𝐚𝐧𝐝 𝐅𝐨𝐫𝐠𝐨𝐭𝐭𝐞𝐧 Developers set inflated CPU/memory limits “just in case” and never revisited them. We ran real-time profiling using Prometheus + Grafana and recalibrated limits based on actual sustained usage. This alone brought down cluster size by 15-20%. 2. 𝐍𝐨𝐧-𝐏𝐫𝐨𝐝 𝐄𝐧𝐯𝐢𝐫𝐨𝐧𝐦𝐞𝐧𝐭𝐬 𝐖𝐞𝐫𝐞 𝐓𝐫𝐞𝐚𝐭𝐞𝐝 𝐋𝐢𝐤𝐞 𝐏𝐫𝐨𝐝𝐮𝐜𝐭𝐢𝐨𝐧 Dev, QA, and Staging environments ran on on-demand instances (24/7). We moved them to spot instances with scheduled shutdowns during non-working hours. That delivered 18-22% savings instantly. 3. 𝐀𝐮𝐭𝐨𝐬𝐜𝐚𝐥𝐞𝐫𝐬 𝐖𝐞𝐫𝐞 𝐌𝐢𝐬𝐜𝐨𝐧𝐟𝐢𝐠𝐮𝐫𝐞𝐝 𝐨𝐫 𝐉𝐮𝐬𝐭 𝐈𝐝𝐥𝐞 Most teams rely purely on CPU-based HPA, which reacts too late. We introduced custom scaling triggers based on business KPIs like request queue lengths, job backlogs, and latency. The result? Clusters scaled proactively, not reactively. 4. 𝐙𝐨𝐦𝐛𝐢𝐞 𝐏𝐨𝐝𝐬 𝐚𝐧𝐝 𝐅𝐨𝐫𝐠𝐨𝐭𝐭𝐞𝐧 𝐑𝐞𝐬𝐨𝐮𝐫𝐜𝐞𝐬 𝐄𝐯𝐞𝐫𝐲𝐰𝐡𝐞𝐫𝐞 One client had 300+ idle pods running outdated builds (nobody knew why). We implemented automated cleanup jobs using lifecycle policies and kubectl prune scripts. That reduced node count immediately. 5. 𝐕𝐞𝐫𝐭𝐢𝐜𝐚𝐥 𝐏𝐨𝐝 𝐀𝐮𝐭𝐨𝐬𝐜𝐚𝐥𝐞𝐫 (𝐕𝐏𝐀) 𝐖𝐚𝐬𝐧’𝐭 𝐄𝐯𝐞𝐧 𝐄𝐧𝐚𝐛𝐥𝐞𝐝 VPA handled unpredictable workloads far better than manual tuning.   For stateful apps with variable patterns, this reduced over-provisioning by up to 25% while maintaining SLAs. 6. 𝐏𝐞𝐫𝐬𝐢𝐬𝐭𝐞𝐧𝐭 𝐕𝐨𝐥𝐮𝐦𝐞 𝐂𝐥𝐚𝐢𝐦𝐬 (𝐏𝐕𝐂𝐬) 𝐖𝐞𝐫𝐞 𝐚 𝐁𝐥𝐚𝐜𝐤 𝐇𝐨𝐥𝐞 Storage costs were silently draining budgets. We audited PVC usage, downgraded unnecessary high-IOPS gp2 volumes to gp3, and cleaned up stale volumes. For one client, this alone saved over $30,000 annually. Before you buy another cloud cost management tool, ask yourself… Have you really optimized what you already own? ♻️ 𝐑𝐄𝐏𝐎𝐒𝐓 𝐒𝐨 𝐎𝐭𝐡𝐞𝐫𝐬 𝐂𝐚𝐧 𝐋𝐞𝐚𝐫𝐧.

  • View profile for Umair Ahmad

    Senior Data & Technology Leader | Omni-Retail Commerce Architect | Digital Transformation & Growth Strategist | Leading High-Performance Teams, Driving Impact

    11,158 followers

    → 𝐌𝐢𝐜𝐫𝐨𝐬𝐞𝐫𝐯𝐢𝐜𝐞𝐬 𝐀𝐧𝐭𝐢 𝐏𝐚𝐭𝐭𝐞𝐫𝐧𝐬 𝐓𝐡𝐚𝐭 𝐐𝐮𝐢𝐞𝐭𝐥𝐲 𝐁𝐫𝐞𝐚𝐤 𝐌𝐨𝐝𝐞𝐫𝐧 𝐒𝐲𝐬𝐭𝐞𝐦𝐬 Most microservices failures do not begin with outages. They begin with design choices that look harmless at first. Until scale exposes them. • 𝐒𝐞𝐫𝐯𝐢𝐜𝐞𝐬 𝐭𝐢𝐠𝐡𝐭𝐥𝐲 𝐜𝐨𝐮𝐩𝐥𝐞𝐝 Boundaries are weak. Teams lose deployment independence. One change starts impacting everything else. • 𝐃𝐢𝐬𝐭𝐫𝐢𝐛𝐮𝐭𝐞𝐝 𝐦𝐨𝐧𝐨𝐥𝐢𝐭𝐡 The system looks distributed on paper. In reality, services cannot evolve or deploy without depending on one another. • 𝐍𝐨 𝐀𝐏𝐈 𝐯𝐞𝐫𝐬𝐢𝐨𝐧𝐢𝐧𝐠 Even a small contract update can disrupt consumers. Backward compatibility protects trust across services. • 𝐓𝐨𝐨 𝐦𝐚𝐧𝐲 𝐦𝐢𝐜𝐫𝐨𝐬𝐞𝐫𝐯𝐢𝐜𝐞𝐬 Over splitting creates operational drag. More services do not always mean better architecture. • 𝐈𝐠𝐧𝐨𝐫𝐢𝐧𝐠 𝐝𝐚𝐭𝐚 𝐜𝐨𝐧𝐬𝐢𝐬𝐭𝐞𝐧𝐜𝐲 Without a clear consistency strategy, transactions become unreliable. This is where Sagas and eventual consistency matter. • 𝐒𝐲𝐧𝐜𝐡𝐫𝐨𝐧𝐨𝐮𝐬 𝐝𝐞𝐩𝐞𝐧𝐝𝐞𝐧𝐜𝐲 𝐜𝐡𝐚𝐢𝐧 Too many blocking calls create fragile service flows. One slowdown can trigger cascading failures. • 𝐍𝐨 𝐟𝐚𝐮𝐥𝐭 𝐢𝐬𝐨𝐥𝐚𝐭𝐢𝐨𝐧 A single failing component should not take down the rest of the platform. Isolation patterns improve resilience. • 𝐂𝐡𝐚𝐭𝐭𝐲 𝐜𝐨𝐦𝐦𝐮𝐧𝐢𝐜𝐚𝐭𝐢𝐨𝐧 Excessive service to service calls increase latency fast. Coarse grained APIs and async messaging reduce noise. • 𝐋𝐚𝐜𝐤 𝐨𝐟 𝐨𝐛𝐬𝐞𝐫𝐯𝐚𝐛𝐢𝐥𝐢𝐭𝐲 When logging, tracing, and metrics are weak, failures become harder to detect and fix. • 𝐒𝐡𝐚𝐫𝐞𝐝 𝐝𝐚𝐭𝐚𝐛𝐚𝐬𝐞 When multiple services use one database, ownership becomes blurry. Independent data boundaries preserve autonomy. • 𝐇𝐚𝐫𝐝𝐜𝐨𝐝𝐞𝐝 𝐜𝐨𝐧𝐟𝐢𝐠𝐮𝐫𝐚𝐭𝐢𝐨𝐧 If every config change needs redeployment, agility suffers. Externalized configuration supports faster adaptation. Microservices are powerful. But only when architecture decisions support clarity, resilience, and scale. Follow Umair Ahmad for more insights

  • View profile for Tom Le

    Unconventional Security Thinking | Follow me. It’s cheaper than therapy and twice as amusing.

    12,837 followers

    The internet wobbled today. A DNS issue in a single AWS region cascaded across otherwise “safe” regions and availability zones. This was not just another regional outage. It was a practical lesson in the cloud's hidden, centralized dependencies. We build for multi-region resilience, but we are often betrayed by "global" services that are not as distributed as they appear. The gap between perceived autonomy and actual entanglement is where resilience fails. My lessons learned from today’s AWS outage: 1. The Control Plane Chokepoint AWS separates data planes (serving traffic) from control planes (the APIs managing resources). Many global control planes live in one region, often us-esst-1. When that hub is impaired, your automation fails. You cannot scale, deploy, or modify resources, even in perfectly healthy regions. 2. The Hidden Dependency Chain The obvious risk is your application failing. The hidden risk is the failure of a core service you do not directly use. Today’s DNS and networking issue rhymes with the 2020 Kinesis outage. A foundational service failed, and higher level systems like Cognito, Lambda, and Auto Scaling began to error simply because they relied on it internally. 3. The Myth of the "Island" Application Even a perfect multi-AZ application is not an island. It must resolve DNS, fetch IAM tokens, pull container images, and push logs. These core functions often rely on shared, centralized services. When those services choke, your redundant application times out. History provides a classic intelligence analog. During WWII, Allied planners knew German communications were heavily encrypted. But they also knew most signals could only transit a few central relay stations. By targeting those nodes, they could blind the entire network without breaking a single code. The cloud's core services are these modern relay stations. We are not just choosing between regional availability and multi-region reliability. We are choosing between apparent distribution and actual fault isolation. The core principle is to understand your actual blast radius. A system is only as resilient as its most critical, least visible dependency. Today is a reminder that resilience is not an architectural diagram. It is the verified, tested ability to withstand the failure of a dependency you probably forgot you had.

  • View profile for Chandra Shekhar Joshi

    Crack FAANG+ Sr., Staff+, EM Behavioural, and System Design HLD interviews | DM me “COACH” | Engineering Manager @ Amazon | Engineering Career Coach | FAANG+ Interview Coach

    27,296 followers

    "We'll use events and CDC for loose coupling." This statement sounds good in a design doc. In reality, it's a top reason for "silent" production failures. An upstream system (Service A) produces data. A downstream system (Service B) needs that data. To avoid a "tight coupling," the engineer has Service B "listen" for changes from Service A. Maybe Service B uses Change Data Capture (CDC) to stream changes from Service A's database. Or it just consumes from a generic event log. Service A doesn't even know Service B exists. This feels like a win for loose coupling. It's actually a time bomb. And then reorg happens, team gets changed completely. The failure happens 3-6 months later. The team for Service A changes their data contract. They rename a field. They refactor the code and stop producing a specific event. Why? Because they forgot Service B was silently listening. The dependency was implicit. It wasn't obvious in their code. Service A's tests pass. They ship their change. Weeks later, Service B breaks. The data is corrupt. The system is down in production. The damage is done, and it takes days to trace and fix the problem, which happened due to a change made a month ago. That's why, stop relying on silent event streams for critical data flows. Use an explicit command instead. This doesn't mean it must be a synchronous API call. It can (and often should) still be an event. But Service A must explicitly publish a well-defined event. The code in Service A should literally say: event_publisher.send("OrderProcessed_v1", data) Now, the dependency is explicit. When the Service A team refactors their code, they see this line. They can't forget it. They are forced to think: "Who consumes OrderProcessed_v1? Oh, Service B. We are moving to v2, so we need to tell them." This conversation happens during development. Not during a production fire. Don't confuse "loose coupling" with "implicit dependencies." One is a good design goal. The other is a production incident waiting to happen. If you are preparing for mid-senior/staff SDE, EM system design HLD interviews, and need help, DM me COACH.

  • View profile for Aditya Jaiswal

    DevOps | Cloud | AI | Production Systems 235K+ @ DevOps Shack YT Mail → office@devopsshack.com

    67,459 followers

    𝗠𝗲𝗺𝗼𝗿𝘆 𝗟𝗲𝗮𝗸𝘀 — 𝗔 𝗥𝗲𝗮𝗹 𝗣𝗿𝗼𝗱𝘂𝗰𝘁𝗶𝗼𝗻 𝗣𝗿𝗼𝗯𝗹𝗲𝗺 Memory leaks are often seen as a developer issue, but in production they quickly become a DevOps incident. - Pods restart. - Nodes hit memory pressure. - Autoscaling increases costs. - Services become unstable. A memory leak happens when an application keeps allocating memory but never releases it. Over time, memory usage grows even when traffic remains constant. 𝗘𝘅𝗮𝗺𝗽𝗹𝗲 — 𝗞𝘂𝗯𝗲𝗿𝗻𝗲𝘁𝗲𝘀 A microservice starts at 200Mi memory. After a few hours it reaches 900Mi without increased load. Result: - OOMKilled containers - CrashLoopBackOff - Unstable deployments The platform is rarely the problem. The application is holding memory longer than it should. Common Causes Node.js example: 𝙘𝙤𝙣𝙨𝙩 𝙘𝙖𝙘𝙝𝙚 = []; 𝙘𝙖𝙘𝙝𝙚.𝙥𝙪𝙨𝙝(𝙙𝙖𝙩𝙖); // 𝙜𝙧𝙤𝙬𝙨 𝙞𝙣𝙙𝙚𝙛𝙞𝙣𝙞𝙩𝙚𝙡𝙮 Java example: 𝙨𝙩𝙖𝙩𝙞𝙘 𝙇𝙞𝙨𝙩<𝙊𝙗𝙟𝙚𝙘𝙩> 𝙨𝙩𝙤𝙧𝙚 = 𝙣𝙚𝙬 𝘼𝙧𝙧𝙖𝙮𝙇𝙞𝙨𝙩<>(); Unbounded collections, static objects, or unclosed resources prevent memory from being reclaimed. 𝑾𝙝𝒂𝙩 𝘿𝒆𝙫𝑶𝙥𝒔 𝑬𝙣𝒈𝙞𝒏𝙚𝒆𝙧𝒔 𝑺𝙝𝒐𝙪𝒍𝙙 𝙇𝒐𝙤𝒌 𝑭𝙤𝒓 ? - Gradual memory growth in metrics - Increased pod restarts - Node MemoryPressure events - Rising RSS usage in Linux tools like top or htop - Diagnosis requires trend analysis, not single snapshots. Memory leaks are not just coding bugs. They are observability and architecture problems. Restarting pods hides the symptom but does not fix the cause. #DevOps #Kubernetes #Linux #CloudEngineering #SRE #Observability #Docker #SystemDesign

  • View profile for Rihab SAKHRI

    Senior Software Developer | Back-End (Go, Python) | Microservices Architect | DevOps & Cloud Computing Advocate | SFC™

    13,382 followers

    𝗛𝗼𝘄 𝗰𝗼𝘂𝗽𝗹𝗶𝗻𝗴 𝗸𝗶𝗹𝗹𝘀 𝘀𝗰𝗮𝗹𝗮𝗯𝗶𝗹𝗶𝘁𝘆 𝗶𝗻 𝗺𝗶𝗰𝗿𝗼𝘀𝗲𝗿𝘃𝗶𝗰𝗲 𝗮𝗿𝗰𝗵𝗶𝘁𝗲𝗰𝘁𝘂𝗿𝗲 Imagine joining a project with 𝟔+ 𝐦𝐢𝐜𝐫𝐨𝐬𝐞𝐫𝐯𝐢𝐜𝐞𝐬 … Everything looks 𝐜𝐥𝐞𝐚𝐧 𝐚𝐧𝐝 𝐨𝐫𝐠𝐚𝐧𝐢𝐳𝐞𝐝, 𝐞𝐚𝐜𝐡 𝐬𝐞𝐫𝐯𝐢𝐜𝐞 𝐡𝐚𝐬 𝐚 𝐧𝐚𝐦𝐞, 𝐚 𝐑𝐄𝐀𝐃𝐌 𝐟𝐢𝐥𝐞, 𝐞𝐯𝐞𝐧 𝐚 𝐃𝐨𝐜𝐤𝐞𝐫𝐟𝐢𝐥𝐞. You think: “𝑻𝒉𝒊𝒔 𝒊𝒔 𝒎𝒐𝒅𝒆𝒓𝒏, 𝒔𝒄𝒂𝒍𝒂𝒃𝒍𝒆, 𝒓𝒆𝒔𝒊𝒍𝒊𝒆𝒏𝒕…” But the moment you try to scale just one service? Surprise.. Service A calls B, B calls C, C calls D… Each one waiting on the next, Each one nested in retries. So you want to scale A? Guess what , you have to scale half the system with it. 𝐀𝐧𝐝 𝐭𝐡𝐚𝐭’𝐬 𝐰𝐡𝐞𝐧 𝐲𝐨𝐮 𝐫𝐞𝐚𝐥𝐢𝐳𝐞: Microservices aren't about how many services you have… They’re about how independent they are from each other. 𝗪𝗵𝗮𝘁 𝗱𝗼 𝘆𝗼𝘂 𝗮𝗰𝘁𝘂𝗮𝗹𝗹𝘆 𝗳𝗶𝗻𝗱? 🔹A system that’s tightly coupled 🔹Blocking calls everywhere 🔹Latency coming from an endless chain 🔹Retry on top of retry = you're creating pressure on yourself 𝗦𝗼 𝘄𝗵𝗮𝘁 𝗱𝗼 𝘆𝗼𝘂 𝗱𝗼? 🔸Shield the clients with an 𝐀𝐏𝐈 𝐆𝐚𝐭𝐞𝐰𝐚𝐲 No more clients talking to 𝟔 𝐬𝐞𝐫𝐯𝐢𝐜𝐞𝐬 to complete one request. The gateway 𝐡𝐚𝐧𝐝𝐥𝐞𝐬 𝐫𝐨𝐮𝐭𝐢𝐧𝐠, 𝐚𝐮𝐭𝐡, 𝐜𝐚𝐜𝐡𝐢𝐧𝐠, everything is centralized. 🔸 Build an 𝐎𝐫𝐜𝐡𝐞𝐬𝐭𝐫𝐚𝐭𝐨𝐫 𝐒𝐞𝐫𝐯𝐢𝐜𝐞 Control service calls: 𝐨𝐫𝐝𝐞𝐫, 𝐜𝐨𝐧𝐝𝐢𝐭𝐢𝐨𝐧𝐬 𝐚𝐧𝐝 𝐫𝐞𝐝𝐮𝐜𝐞 𝐭𝐡𝐞 𝐜𝐨𝐮𝐩𝐥𝐢𝐧𝐠 𝐚𝐬 𝐦𝐮𝐜𝐡 𝐚𝐬 𝐩𝐨𝐬𝐬𝐢𝐛𝐥𝐞. 🔸Bring in 𝐀𝐬𝐲𝐧𝐜 𝐌𝐞𝐬𝐬𝐚𝐠𝐢𝐧𝐠 𝐰𝐢𝐭𝐡 𝐑𝐚𝐛𝐛𝐢𝐭𝐌𝐐/𝐊𝐚𝐟𝐤𝐚, no more waiting, each service sends a message and moves on. 𝗘𝘅𝗮𝗺𝗽𝗹𝗲: instead of A calling B and waiting, 𝐀 𝐞𝐦𝐢𝐭𝐬 𝐚𝐧 𝐞𝐯𝐞𝐧𝐭 𝐥𝐢𝐤𝐞 "𝐎𝐫𝐝𝐞𝐫𝐏𝐥𝐚𝐜𝐞𝐝" B listens and does its work 𝐨𝐧𝐥𝐲 𝐰𝐡𝐞𝐧 𝐭𝐡𝐞 𝐞𝐯𝐞𝐧𝐭 𝐚𝐫𝐫𝐢𝐯𝐞𝐬. 🔸Add 𝐎𝐛𝐬𝐞𝐫𝐯𝐚𝐛𝐢𝐥𝐢𝐭𝐲 𝐰𝐢𝐭𝐡 𝐎𝐩𝐞𝐧𝐓𝐞𝐥𝐞𝐦𝐞𝐭𝐫𝐲 No guessing. Trace everything: From the initial request to the final service. See latency, failures and bottlenecks in real time. 𝗟𝗲𝘀𝘀𝗼𝗻 𝗹𝗲𝗮𝗿𝗻𝗲𝗱 : 👉 Microservices that are tightly coupled = just a monolith in disguise. 👉 Want to scale? You need decoupling. 👉 Want resilience? You need async and observability. 👉 Want control? You need clear abstraction and boundaries. 𝗧𝗵𝗶𝗻𝗸 𝗮𝗯𝗼𝘂𝘁 𝗶𝘁: If every service is afraid to change… You're not in a distributed system. You're in distributed fear. #Microservices #DevOps #Architecture #Scalability #SystemDesign #SoftwareEngineering #CloudComputing #Async #Observability #TechTips

  • View profile for Akum Blaise Acha

    Senior DevOps & Platform Engineer | AWS, Docker & Kubernetes Expert | 6+ Years Designing Scalable, Reliable, Cost-Efficient Cloud Systems | Mentor & Newsletter Creator for 1500+ Engineers

    4,009 followers

    A pod in your Kubernetes cluster is eating memory. It started at 200MB at deploy time. It's now at 1.8GB and climbing. No memory limit was set in the deployment. Other pods on the same node are getting OOMKilled. What do you do as an immediate fix? And what do you change so this never happens again? I have experienced this before while working as a DevOps Engineer. Here's what I learned. The immediate fix is not to kill the pod. Your first instinct is to delete it. Don't. If there's a deployment behind it, Kubernetes will restart it immediately and the memory leak starts all over again. You've bought yourself 20 minutes before you're back in the same situation. Instead, cordon the node first. This tells Kubernetes to stop scheduling new pods on that node. The damage is now contained. No new victims. Then set a memory limit on the deployment and redeploy. Even a generous limit like 512MB is better than no limit. The pod will get OOMKilled when it crosses 512MB instead of eating 1.8GB and starving everything around it. The leak still exists but now it has a ceiling. After that, check the other pods that were OOMKilled. They didn't die because of their own problems. They died because your leaking pod stole their memory. Kubernetes kills the pods it considers least important when the node runs out of memory. Your perfectly healthy services got evicted because one pod had no manners. Now the real work. Making sure this never happens again. Every pod in your cluster needs resource requests and limits. Every single one. No exceptions. A pod without memory limits is a pod that can consume the entire node. It's not a question of if. It's when. Enforce this with admission controllers. Use OPA Gatekeeper or Kyverno to reject any deployment that doesn't include resource limits. Don't rely on code reviews to catch this. Humans miss things. Policy engines don't. Add monitoring on container memory trends. Not just current usage. The trend. A pod sitting at 400MB is fine. A pod that was at 200MB yesterday and is at 400MB today and will be at 800MB tomorrow is a leak. Alert on the rate of change, not just the threshold. Set up namespace-level ResourceQuotas. Even if one team forgets limits on a pod, the namespace itself has a ceiling. One team's leak can't consume the entire cluster. And finally, fix the actual memory leak. Profile the application. Check for unclosed connections, growing caches, event listeners that never get cleaned up. The infrastructure guardrails keep you alive but the application code is where the real fix lives. Systems without boundaries will always consume everything available to them. Your job isn't just to fix incidents. It's to make sure the environment enforces good behavior even when humans forget. How would you handle this? #kubernetes #devops #platformengineering #sitereliability #cloudinfrastructure #systemdesign #containerorchestration

  • View profile for Vijay Kumar Anuganti

    ∞ Simplifying DevOps | 🤝 Helping Freshers | ☸️ Turning Outages Into Uptime | 🎯 Devops Tools | 🎖️Top Devops Voice.

    13,258 followers

    🕒 2:13 AM Alert: "CPU throttling detected on all nodes." We were under attack — or so we thought. Pods were crashing randomly. Services were flaky. CPU metrics were through the roof. Traffic was normal. No spike. No malicious activity. Yet, our production cluster was choking. 🔍 SREs jumped in. "Autoscaler isn't working," one said. "Node CPU is 95%," another pointed out. "Pods are hitting resource limits," DevOps chimed in. Everyone was looking at the symptoms, not the root cause. We added more nodes. Same issue. Added bigger nodes. Still throttling. Something didn't add up. 😓 Teams started questioning each other. Infra blamed the app. App team blamed resource limits. Platform team blamed Kubelet. I paused. Opened one of the crashing pod's YAML files. Then another. Then another. Same pattern. resources: limits: cpu: "200m" 💡 That’s when it hit me. Helm chart’s default values had overridden resource settings for ALL production pods. Our app containers were running with dev limits in prod. Pods weren't allowed to use more than 200m CPU — even when they needed 2 cores. The node had plenty of CPU. But containers were choking themselves. 🔧 We patched the Helm release with correct values. Restarted the pods. Boom — stable in 2 minutes. Lessons Learned: ✅ Always audit Helm defaults before promoting to production ✅ CPU throttling ≠ actual CPU usage issues ✅ Blame doesn’t solve anything — YAML does 🚀 Kubernetes teaches you one thing: Production isn't about just running pods. It's about understanding what's running — and why it’s failing. 💬 Have you seen a weird production issue like this? 👇 Let me know in the comments. Kubernetes Kubernetes #kubernetes #production #issues #sre #devops #alerts #pods #services #helm

  • View profile for DeVaris Brown

    Thinker. Builder. Hustler. Investor.

    15,891 followers

    Amazon’s official postmortem, https://lnkd.in/giuwe2VY, on the us-east-1 outage reads like a deep dive into how fragile control planes can be when automation meets scale. A single race condition in DynamoDB’s DNS automation spiraled into EC2 launch failures, Lambda throttling, and NLB health-check chaos cascading across nearly every dependent service. It’s not that AWS is unreliable; it’s that our architectures are too tightly coupled to vendor control planes. When DNS, orchestration, and routing all depend on the same automation layer, “multi-AZ” isn’t the same as resilient. Here are a few lessons from the postmortem worth carrying forward 👇 DNS isn’t invincible. Independent DNS and health checks give you a fallback path when a provider’s control plane falters. Compute orchestration can fail noisily. Design workloads to recover even when new capacity can’t launch. Health checks can amplify failure. Add damping, delay, and edge-managed routing to avoid self-inflicted downtime. Dependencies cascade. The weakest link isn’t your app; it’s the invisible systems you assume “just work.” This outage reminded us that resilience isn’t just about redundancy, it’s about decoupling. I pulled together an Engineering Checklist for Control-Plane Resilience to help teams assess how exposed they are to these same failure patterns. Comment AWS and I’ll share the doc.

  • View profile for Prabhat Sharma

    Founder @ OpenObserve | Open source Observability | Helping engineering teams scale observability without the data tax | Cloud Native & Container Specialist

    8,966 followers

    The pods were OOMing, and the engineering team was adamant: "We didn't change a thing." This was back during my time at AWS. A customer's production was effectively halted, stuck in a restart loop. I hopped on a call with customer's engineering and infra team. The problem with these incidents is the constraint of time. You can’t learn a stranger's complex application logic in 60 minutes. It’s impossible to debug the code effectively without deep domain knowledge, and we didn't have the luxury of time. But as an architect, you don't always need to fix the code to stop the bleeding. You just need to control the physics of the infrastructure. I stopped trying to understand why the app was crashing and looked at how it was deployed. When I checked the manifest: No resource requests. No limits. ⚠️ The Kubernetes scheduler was flying blind. It was placing memory-hungry pods on nodes that couldn't handle the unexpected spikes, causing cascading failures across the cluster. I told the team: "I don't know the specific bug causing this memory pressure, and I can't fix that right now. But I can make sure the infrastructure survives it so you have time to debug." We implemented two changes immediately: 1. Set hard requests and limits. 2. Enabled the Horizontal Pod Autoscaler (HPA). The effect was immediate. Instead of crashing the nodes or starving neighbors, the individual pods were constrained. When load spiked, HPA spun up more replicas rather than letting a single instance bloat until it died. 🛡️ The system stabilized. The bleeding stopped. Did this burn more compute? Absolutely. The bill went up because we threw infrastructure at an application inefficiency. But that extra cost was the price of survival. A few days later, a box of chocolates showed up at the office, sent directly from the CEO. The lesson here isn't that K8s configuration is magic. It's that good architecture buys you time. Resilience isn't about writing bug-free code; it's about building a system that can survive the bugs you inevitably write. 🏗️ #Kubernetes #SRE #AWS #SystemDesign #OpenObserve

Explore categories