What Happens When “Always On” Goes Off... Yesterday’s AWS outage was a quiet but powerful reminder of how fragile our digital world really is. Between 12:19 PM on Oct 20 and 3:31 AM on Oct 21 (IST), AWS’s US-EAST-1 region — the backbone for thousands of global applications — went through hours of instability. The trigger? A DNS resolution issue in DynamoDB, one of AWS’s core services. In simpler terms, AWS’s own systems temporarily lost the ability to “find” DynamoDB. Because so many other services rely on it — from EC2 and Lambda to CloudWatch and IAM — the disruption cascaded rapidly across the ecosystem. ** The Chain Reaction Once DynamoDB became unreachable: > EC2 couldn’t launch new instances. > Network Load Balancers started failing health checks. > Lambda, CloudWatch, and others dependent on them began slowing down or dropping requests. AWS fixed the core DNS issue by 2:54 PM IST, but full stability didn’t return until 3:31 AM the next day. For many organizations, that meant hours of degraded performance, failed deployments, and broken dashboards. ** The Real Lesson This wasn’t just a glitch — it was a lesson in dependency risk. A single point of failure in one service — something as fundamental as DNS — created ripple effects across the most advanced cloud infrastructure in the world. It highlights how deeply intertwined AWS services are, and how organizations may be far more dependent on one region or one core service than they realize. ** What We Can Learn If your workloads depend heavily on AWS (especially US-EAST-1): > Build for failure. Assume every service can break at some point — and design your systems to degrade gracefully. > Go multi-region, not just multi-AZ. Redundancy must extend beyond local failover. > Map your hidden dependencies. Many systems depend indirectly on DynamoDB, IAM, or S3 even if you don’t use them directly. ** A Final Thought We often talk about “the cloud” as if it’s limitless and infallible. But outages like this remind us that behind every cloud service are complex systems — and humans — doing their best to keep it all running. 𝐑𝐞𝐬𝐢𝐥𝐢𝐞𝐧𝐜𝐞 𝐢𝐬𝐧’𝐭 𝐬𝐨𝐦𝐞𝐭𝐡𝐢𝐧𝐠 𝐀𝐖𝐒 𝐠𝐢𝐯𝐞𝐬 𝐮𝐬. 𝐈𝐭’𝐬 𝐬𝐨𝐦𝐞𝐭𝐡𝐢𝐧𝐠 𝐰𝐞 𝐡𝐚𝐯𝐞 𝐭𝐨 𝐝𝐞𝐬𝐢𝐠𝐧, 𝐭𝐞𝐬𝐭, 𝐚𝐧𝐝 𝐭𝐚𝐤𝐞 𝐨𝐰𝐧𝐞𝐫𝐬𝐡𝐢𝐩 𝐨𝐟. Thoughts?
Understanding the Effects of Cloud Outages
Explore top LinkedIn content from expert professionals.
Summary
Understanding the effects of cloud outages means recognizing how disruptions in cloud services—from providers like AWS or Cloudflare—can quickly impact global connectivity, business operations, and everyday apps. These outages often happen because of technical failures or internal misconfigurations, revealing how interconnected, and sometimes fragile, our reliance on cloud infrastructure truly is.
- Plan for disruption: Build your systems with backup options and test disaster recovery strategies so you’re prepared if a cloud provider goes offline.
- Analyze dependencies: Regularly review which parts of your operations depend on specific cloud services and identify single points of failure that could cause wider outages.
- Communicate transparently: Share updates and explanations with stakeholders when outages happen, building trust and keeping everyone informed about recovery progress.
-
-
What happened with Cloudflare and what can we learn from this? On 18 November 2025, Cloudflare, one of the most critical web-infrastructure providers suffered a major global outage. 0. The root cause was a configuration bug: a change in a database permission caused a “feature file” used by Cloudflare’s Bot Management system to double in size. 1. This oversized file crashed core proxy software, triggering widespread HTTP 5xx errors across the network. 2. Services like X (formerly Twitter), ChatGPT, Canva, and even public systems like NJ Transit were affected. 3. Cloudflare identified the problem, rolled back to a safe configuration, and fully restored services by ~17:06 UTC. 4. Importantly: this was not a cyber attack. No malicious activity was found. 💡 What can we learn from this: 0. Dependency risk is real When so much of the internet (or your own stack) depends on a single provider like Cloudflare, their outages cascade across countless services. This is a powerful reminder to architect with redundancy, fallback systems, or multi-provider strategies. 1. Internal config changes are high risk The outage wasn't caused by an external threat but by an internal configuration change. Even “trusted” pipelines (like automated config systems) can lead to massive failure. This teaches us to treat internal config files with the same rigor as user input: validate size, schema, and limits before widespread deployment. 2. Need for robust rollback & kill-switch mechanisms Cloudflare’s fix involved stopping propagation of the bad file, rolling back to a known good version, and restarting services. For high-availability systems, bring your rollback and global kill-switch strategies to the table early, make them first-class citizens. 3. Transparent incident communication matters Cloudflare didn’t shy away: they communicated what went wrong, how they fixed it, and what they’ll do to prevent it again. That builds trust. As engineering and leadership teams, we should commit to similar transparency when things go wrong. 4. Design for failure, always No matter how mature the system, failures will happen. What differentiates resilient teams is how you respond, how fast you detect, diagnose, and mitigate. This means investing in good observability, chaos testing, and incident response playbooks. This outage is a great reminder: even foundational, “trusted” infrastructure can fail in unexpected ways. As builders, we must constantly question assumptions, design for redundancy, and prioritize resilience. #CloudflareOutage
-
Half the global internet went dark on Monday, and billions vanished in a single day... all due to an invisible glitch inside a data center in Virginia. If you couldn’t access Snap Inc., Canva, Zoom, Fortnite, Venmo, Reddit, Inc. or even your Alexa on Monday, then you weren’t alone. What looked like a “random glitch” was actually a massive Amazon Web Services (AWS) outage that quietly took down half the internet for hours. I've spent 14+ years in data analytics and cloud ecosystems, and I found this incident a bit alarming. It’s a reminder of how deeply centralized our digital world has become. Here’s what really happened... At around 3 AM ET, Amazon Web Services’ US-EAST-1 region, its most critical hub in Virginia, also known as “Data Center Alley,” suffered a major disruption. The culprit was an internal subsystem failure that monitors the health of network load balancers, followed by errors in their DNS system. Now, DNS might sound technical, but think of it as the internet’s address book, it translates easy-to-remember URLs into IP addresses. When that breaks, your browser simply can’t “find” the website. And because AWS powers over 30% of the global cloud market, the ripple effects were instant and global: - Social media apps like Snapchat, Reddit, and Facebook went down. - Banking and airline systems froze mid-operation - Even government sites like the UK’s HMRC and GOV.UK were impacted - And businesses worldwide lost millions every passing hour To put it in perspective, some reports suggest global platforms lose about $75 million per hour when they go down. For AWS, whose clients range from startups to Fortune 500 giants, the economic ripple could run into hundreds of billions. So why does this matter beyond a temporary glitch? I believe this outage shows how fragile our digital interdependence has become. A single region’s misconfiguration can disrupt global connectivity in minutes. For those of us in data and cloud engineering, it’s a wake-up call: - Redundancy isn’t optional anymore, it’s survival. - Dependency on one cloud provider (even AWS) is a business risk. - Observability and failover design must evolve beyond reactive monitoring. If you rely on cloud platforms like AWS, Microsoft Azure, or GCP, this outage was a lesson. What’s your biggest takeaway from it? #startups #internet #infrastucture #cloudcomputing #data
-
🚨 Cloudflare Outage - Quick Technical Breakdown for DevOps Engineers ⚙️🌐 Today’s global outage affecting major platforms (X, ChatGPT, etc.) was caused by an unexpected traffic spike that overloaded parts of Cloudflare’s network while some datacenters were under maintenance, leading to widespread 5xx errors. 🔍 What happened technically: • A sudden traffic surge overloaded Cloudflare’s edge and routing layers. • Internal service degradation spread across regions. • Maintenance in multiple datacenters reduced available capacity. • Fix deployed, but propagation delays caused lingering issues. 💡 Key lessons for DevOps / Cloud engineers: • Avoid relying on a single CDN or DNS provider. • Implement multi-region and multi-vendor failover. • Monitor anomalies: traffic spikes, DNS latency, error-rate increases. • Test disaster recovery and chaos scenarios regularly. • Build and maintain recovery servers / backup infrastructure that can take over automatically during provider outages. Even the biggest providers can fail, resilience is something you design, not something you hope for. 💪🌍 #DevOps #Cloud #SRE #Cloudflare #Outage #Engineering #DisasterRecovery
-
⚠️ AWS Outage Analysis: Why Redundancy Matters More Than Ever 🤔 🗺️ Region Affected: US-EAST-1 On October 20, 2025, starting at 12:11 AM PT a major AWS outage brought down hundreds of apps and services worldwide — from gaming platforms to banking apps to smart home devices. Let’s break down what actually went wrong 👇 1️⃣ It started with a DNS issue — AWS services couldn’t properly resolve domain names for the DynamoDB API endpoint, preventing proper communication between services. 2️⃣ Because of that, DynamoDB (the AWS database service) started experiencing significant error rates. Since countless AWS tools and customer applications depend on DynamoDB, they too began to break down. 3️⃣ Soon, EC2 instances (virtual machines) and Lambda functions (serverless code) couldn’t launch or execute properly. The problem originated within EC2’s internal network, creating a chain reaction across multiple dependent services. 4️⃣ AWS identified the root cause around 2:01 AM PT and began mitigation. By 3:35 AM PT, the DNS issue was initially resolved. At 6:35 AM ET, AWS declared the database problem “fully mitigated.” However, services continue to experience intermittent issues and multiple waves of disruption throughout the day as systems struggle to fully recover and resynchronize. ⚠️ Current Status: As of October 21, many services are still reporting ongoing issues. AWS is continuing to work on full restoration, with EC2 instance launches gradually recovering across availability zones. Some users are experiencing a second wave of outages. 💡 Key takeaway: This incident demonstrates how a DNS resolution problem in one critical region can trigger prolonged cascading failures across the entire internet. Even after AWS declared the issue “fully mitigated,” the recovery process itself has caused secondary waves of disruption — affecting everything from your morning Alexa alarm to global banking systems. This reveals the fragility of our centralized cloud infrastructure, where the recovery from a single point of failure can be as disruptive as the initial outage itself. 📊 Impact Highlights: • Actual duration: 20+ hours and counting with multiple waves of disruption • Services affected: Fortnite, Snapchat, Signal, Amazon, Ring, Alexa, Coinbase, Venmo, United Airlines, Reddit, Duolingo, Starbucks app, Hinge, Canvas (education platform), and hundreds more • Current issues: EC2 instance launches still problematic; intermittent connectivity issues across many services • Scope: Global impact despite being a regional issue in Northern Virginia 🔗 This serves as a reminder for businesses to implement multi-region redundancy and disaster recovery strategies. #AWS #CloudComputing #DevOps #Infrastructure #SiteReliability #Downtime #AWSOutage #TechUpdate #DNS #DynamoDB #CloudOutage
-
The AWS Outage Every CISO Should Be Talking About On October 20, 2025, Amazon Web Services suffered a major disruption in its US-East-1 region that rippled across the global internet. The incident disrupted over 2,500 organizations and exposed widespread dependency on a single region for DNS, database, and authentication services. This wasn’t a cyberattack but the outcome mirrored one. Operations stalled, dashboards went dark, and users around the world were locked out of mission-critical systems. Businesses assuming the cloud guaranteed resilience received a sharp reminder: convenience doesn’t guarantee continuity. What Actually Happened Investigations show that a centralized DNS and DynamoDB chain failure triggered cascading outages across AWS’s identity and control layers. Within minutes, services supporting financial platforms, collaboration tools, and enterprise apps failed. Critical platforms like Snapchat, Coinbase, Atlassian, and government systems such as HMRC were affected. Outages spread not because of compromised data, but because shared configurations and dependencies were not regionally isolated. Lessons for CISOs 1. Resilience Is Executive-Driven Resilience can no longer live exclusively within IT. It sits at the intersection of cybersecurity, risk, and business continuity. Boards and CISOs should establish resilience KPIs reflecting real recovery time, not just uptime percentages. Live failure simulations are essential; automation alone is not enough. 2. Treat Multi-Cloud as a Security Control Cloud diversity is essential for survival. CISOs must ensure alternate DNS, region isolation, and identity redundancy are architected into design, not deferred to vendor defaults. 3. Understand AI’s Hidden Pressure on Cloud Hyperscalers expanding to support AI workloads face unprecedented traffic and complex dependencies. Analysts expect more frequent service-level disruptions as AI data demands surge. Continuity plans must include AI workload impacts. 4. Enterprise Autonomy Is Making a Comeback Hybrid and repatriated architectures gain interest due to sovereignty, compliance, and autonomy needs. Storing critical data and identity functions outside hyperscalers is a resilience strategy, not just a cost decision. The Boardroom Takeaway The AWS outage was a warning. Incidents will come not only from attacks but from complexity. Boards should ask: Have we mapped cloud dependencies by region and service? Are our authentication and DNS systems isolated from the same failure chain? Could we maintain core operations for four hours without our primary region? Survival hinges on planning for inevitable provider failures, not just hoping for uptime. #CISO #CyberResilience #AWS #BusinessContinuity #CloudSecurity #RiskManagement #DigitalInfrastructure #AWSOutage #BoardGovernance #CloudStrategy #Cybersecurity
-
Yesterday’s AWS outage didn’t just crash websites, it crashed assumptions. Millions of people suddenly learned that “the cloud” is really a handful of buildings on the East Coast. When one sneezes, the whole digital world catches a cold. For lawyers, business leaders, and anyone who signs vendor contracts, this is more than tech drama. Outages reveal in real time what your agreements are actually worth. Here are three truths I was reminded of yesterday: 1. SLAs aren’t safety nets. A service credit doesn’t restore your customers’ trust. 2. Redundancy isn’t magic. “Backups” don’t mean continuity unless it’s architected and contracted that way. 3. Communication is a legal obligation. Vague “reasonable efforts” language won’t cut it when your users can’t log in. Contracts aren’t just paperwork. They are the infrastructure of trust. Contract intelligence and analytics can help make your contracts more resilient and turn these lessons into a few actionable insights before the next outage hits. -------- Olga V. Mack Building trust and creating new categories at the intersection of contract intelligence, commerce, and AI. Let’s shape the future together.
-
Today's outages is a great reminder of the most dangerous mentality that pervades our digital world... The belief in the 𝘁𝗼𝗼 𝗯𝗶𝗴 𝘁𝗼 𝗳𝗮𝗶𝗹 cloud provider or centralized service. As such, we said bye-bye to: 😵 Canva 😵 Coinbase 😵 Substack + many more... Today, a significant chunk of the internet, from crypto exchanges to creative platforms, slowed or stopped because of a single point of failure within a major cloud system. 🐘 This event isn't an anomaly; it's a recurring alarm that both investors and entrepreneurs must stop snoozing. The consolidation of the internet onto a handful of hyperscale cloud providers has created unprecedented efficiency and scale. However, it has also created a critical single point of failure, hiding enormous risk under a veneer of convenience. Let's break this down into something more tangible! For Investors: 🪙 Systemic Risk: A company that relies solely on one centralized infrastructure for its entire operation is subject to systemic, unmitigable risk. A 3-hour AWS outage can erase millions in value and severely damage brand trust. 🪙 Due Diligence Must Evolve: Beyond reviewing financials and market share, investor due diligence now requires a deep dive into a company's operational resilience. Ask: Where do you run your core services? What is your failover strategy? How quickly can you move and rebuild this? For Entrepreneurs: 🪙 The Cost of Convenience: Building on a single big cloud is fast, but it compromises your operational independence. You are essentially renting a dependency. 🪙 Reputation is Resilience: In a competitive landscape, your users will forgive occasional technical difficulties, but they will not forgive a total shutdown due to poor planning. Operational resilience is now a core part of your customer value proposition and brand integrity. For the record 💡 a material workload is any application or service whose failure would cripple your business (e.g., core transaction database, user authentication, or primary website). These workloads require a strategic approach that rejects the too big to fail assumption. So next time, you're thinking about hedging all your bets onto a single company, 𝗿𝗲𝗺𝗲𝗺𝗯𝗲𝗿: 𝗥𝗲𝘀𝗶𝗹𝗶𝗲𝗻𝗰𝗲 𝗶𝘀 𝘁𝗵𝗲 𝗻𝗲𝘄 𝗥𝗢𝗜. 😉 The investment and engineering communities must pivot from prioritising sheer scale and convenience to demanding operational resilience because today's downtime isn't just a technical glitch... it's a financial and strategic failure. #buildbetter #scalefaster #failless Mangrove
-
We like to think of the digital layer as ethereal - an invisible, always-on hum beneath our lives. Yet every so often, something small and stupid, like a bad DNS update or a botched security patch, reminds us that the digital isn’t abstract at all. It’s physical. It’s brittle. It runs everything. When AWS went down yesterday, flights were delayed, EVs wouldn’t charge, financial trades froze, and warehouse trucks sat idle. A few corrupted entries in a table - and suddenly, the gears of modern life seized. It’s the same uncanny feeling as the CrowdStrike outage last year, when a faulty software update bricked millions of Windows machines overnight, or the Colonial Pipeline hack, when bits literally stopped the flow of oil. These moments are unsettling because they puncture the illusion of permanence. We treat systems like AWS, Azure, or GCP as immutable laws of nature - forces as dependable as gravity - until they aren’t. And that moment of failure exposes how deeply we’ve outsourced resilience: to vendors, to APIs, to infrastructure no one individual understands end-to-end. When a hyperscaler hiccups, it’s not a website that goes dark; it’s supply chains, classrooms, hospitals - life, briefly, buffering. We live in a civilization built not on stone and steel, but on syntax and dependencies. Progress only compounds fragility. Every new layer of abstraction creates another surface for failure. We are, in the end, a species balancing godlike computation on foundations written in plain text. And when the cloud crashes, our fantasy of control blinks out for a moment, and lets us see the scaffolding beneath. There’s something bracing, almost healthy, about moments like this. A forced reminder that redundancy is everyone’s responsibility - enterprises, consumers, governments, citizens alike. Your car, your bank, your thermostat, your coursework: all hinge on code written by a handful of engineers somewhere you’ll never meet. The lesson is humility: to build as if the cloud can fail, because eventually it will. Outages are not freak accidents, they’re inevitable events in a complex adaptive system. - Companies should regularly rehearse downtime and map their dependencies. - Critical systems need second paths, local caches, and the ability to breathe when the network chokes. - Governments should ask not only who controls the data, but who controls the defaults. - And for the rest of us, it’s a quiet prompt to back things up, to know the analog fallback, to resist the seduction of seamlessness. These outages are the closest thing we get to a societal fire drill. They shake us awake to the realization that our modern world isn’t virtual - it’s just invisibly physical, improbably balanced, and only as strong as its weakest line of code.
-
Yesterday Cloudflare experienced its worst outage since 2019, and a significant portion of the internet went dark for 3 hours. How did this happen? 🌩️ Cloudflare was upgrading their ClickHouse database security - moving from shared accounts to individual user accounts 🌩️ This changed how database queries behaved, resulting in duplicate data (the DB query did not have a "WHERE" clause to limit it to a single database) 🌩️ This duplicate data went into a "feature file" used by their Bot Management system, making the file twice as large as expected 🌩️ The Bot Management software had a hardcoded limit of 200 features, but the file now contained 200+ features - causing servers to crash with a "panic" error when they tried to load the oversized file 🌩️ This bad file was automatically distributed to all Cloudflare servers worldwide within minutes 🌩️ Users trying to visit Cloudflare-protected sites got "Internal Server Error" pages, from 11:28 UTC to 14:30 UTC This was initially suspected to be a DDoS attack - in fact there was a massive 15Tbps UDP flood Aisuru botnet attack against a MS Azure IP address in Australia on the day before. There are many similarities to last month's massive AWS us-east-1 outage that started with DynamoDB, and I am surprised (shocked?) that it was not caused by DNS. There are some useful lessons in resilience here: ✅ Use incremental rollout when deploying changes ✅ Validate inputs - cannot afford to make assumptions with mission critical system ✅ Degrade gracefully when an error condition is encountered (better error handling) ✅ Critical systems need Failure Modes Effects and Criticality Analysis (FMECA - something I learned at engineering school) to understand how a small change in one system could bring down many other systems, then apply change control accordingly! We increasingly exist in a digital monoculture, where Single Points of Failure (SPOFs) can impact most of the Internet. How are you avoiding being taken offline by a SPOF?
Explore categories
- Hospitality & Tourism
- Productivity
- Finance
- Soft Skills & Emotional Intelligence
- Project Management
- Education
- Leadership
- Ecommerce
- User Experience
- Recruitment & HR
- Customer Experience
- Real Estate
- Marketing
- Sales
- Retail & Merchandising
- Science
- Supply Chain Management
- Future Of Work
- Consulting
- Writing
- Economics
- Artificial Intelligence
- Employee Experience
- Healthcare
- Workplace Trends
- Fundraising
- Networking
- Corporate Social Responsibility
- Negotiation
- Communication
- Engineering
- Career
- Business Strategy
- Change Management
- Organizational Culture
- Design
- Innovation
- Event Planning
- Training & Development