If you’re a Cloud Engineer, here’s the Azure Storage knowledge that will actually move the needle for you in 2026. Not for the hype ~ but resilience and availability are no longer “nice to have.” They’re becoming core architecture skills. Here’s what will truly give you an edge: Locally Redundant Storage (LRS) ↳ Your data gets 3 copies inside a single datacenter in the primary region. ↳ Ideal for cost-optimized workloads, but you’re still exposed if the whole datacenter goes down. Zone-Redundant Storage (ZRS) ↳ Data is synchronously copied across three availability zones in the same region. ↳ Gives high durability and zone failure protection without leaving the region. Geo-Redundant Storage (GRS) ↳ Microsoft replicates your data from the primary region to a paired secondary region. ↳ Even if your entire region experiences an outage, your data is still safe and recoverable. Geo-Zone-Redundant Storage (GZRS) ↳ The strongest redundancy tier: ZRS within the primary region + geo-replication to a secondary region. ↳ Designed for mission-critical workloads that can’t afford regional or zonal downtime. If you understand when to use LRS, ZRS, GRS, and GZRS, you’re already ahead of 90% of engineers designing cloud-native systems.
Redundancy Strategies for Hosting
Explore top LinkedIn content from expert professionals.
Summary
Redundancy strategies for hosting are methods used to ensure websites and applications stay available and resilient, even if parts of the system fail. By duplicating resources and spreading them across different locations or providers, businesses can minimize downtime and protect against unexpected outages.
- Distribute across providers: Host your sites or workloads on different cloud services and domain registrars to reduce reliance on any single company.
- Utilize regional backups: Store copies of your data and applications in multiple geographic locations so you can quickly recover if one area experiences a disruption.
- Test failover setups: Regularly simulate outages and switch traffic to backup systems to make sure your redundancy plans work when you need them most.
-
-
Here’s how massively scaled systems like Google Cloud, AWS, Meta, and Netflix hide their system failures from millions of users in plain sight. I can tell you for a fact that no distributed system on earth is truly failure-proof. Amazon S3 went down in 2017 and took half the internet with it, even the AWS status page was affected. In 2021, Facebook’s global outage knocked WhatsApp, Instagram, and Facebook offline for hours, leaving billions unable to connect. Even Google Cloud has had multi-region disruptions that hit major apps worldwide. But here’s the crazy part: for every headline-grabbing outage, a thousand failures are happening quietly in the background every day. The reason you don’t notice most of them? Engineering teams are masters at hiding failures, building systems so resilient that users rarely see when things go wrong. Here are some strategies they use: 1. Use Circuit Breakers Stops dependent services from crashing your entire system. — When Service A relies on Service B and B is down, A will quickly "trip" and stop making requests to B, avoiding system overload. Fallback isn’t always about cached data. Sometimes the fallback is just a fast, clear failure (“Sorry, service unavailable”) rather than degraded or stale functionality. 2. Graceful Timeouts and Retries Set timeouts so your system never hangs forever waiting for a response. Use exponential backoff with jitter to retry failed requests—this avoids overwhelming struggling services. Always cap the max number of retries and the total retry window. Unbounded retries can make a bad situation much worse. Example: After 3 attempts and 10 seconds total, just return an error instead of looping forever. 3. Cache Strategically - Serves stale but usable data when the real-time system is unavailable. - Cache frequently accessed data like user profiles or product listings. - Example: If your database goes down, serve product details from your cache until it’s back online. 4. Load Shedding - Protects critical services by gracefully shedding non-critical traffic. - Prioritize requests that matter most (e.g., checkout requests) and temporarily block less critical ones (e.g., recommendations). 5. Static Fallbacks - Prepares your system to serve static responses when dynamic systems fail. - Host a static version of your website on a CDN for critical flows like "Order Placed" or "Thank You" pages. 6. Queue-Based Workflows Decouple user requests from backend processing using queues—this smooths out spikes and lets you recover gracefully. Crucially, make all queued tasks idempotent: if a request is retried or re-queued, it must not create duplicates or inconsistent states. Example: During a flash sale, add user orders to a queue and process them one by one, sending notifications only once per order—even if the system retries. Continued in Comments...
-
We often confuse High Availability (HA) with Disaster Recovery (DR). In a standard 3-Tier architecture, knowing the difference is what saves your job during a major outage. Let's break down the classic stack, where the Single Points of Failure (SPoF) hide, and how to build a DR strategy that actually works. 1️⃣ The "Standard" 3-Tier Context Most cloud-native apps follow this logical flow: Presentation Tier: The entry point (ALB, Nginx, React) handling user traffic. Application Tier: The business logic (EC2, Lambda, Python/Java) processing the requests. Data Tier: The source of truth (RDS, DynamoDB) storing the state. It looks clean on a whiteboard. But if you deploy this naively into a single Availability Zone (AZ), you are walking on thin ice. 2️⃣ Where the Single Points of Failure Hide Many teams think, "I have an Auto Scaling Group, so I'm safe." Wrong. Here is where the architecture breaks under pressure: 🚩 The Database (The obvious SPoF): A single RDS instance. If the hardware fails or patching hangs, your entire application stops. 🚩 The Network (The hidden SPoF): Relying on a single NAT Gateway for all private subnets. If that one gateway has an issue, your app servers lose connection to 3rd party APIs. 🚩The Region (The ultimate SPoF): Hosting everything in us-east-1 without a backup. If the region faces a service disruption (like S3 or IAM issues), no amount of local auto-scaling will save you. 3️⃣ The Solution: From Fragile to Anti-Fragile True resilience requires a two-pronged approach: Phase A: Local Resilience (High Availability) Multi-AZ Deployment: Spread your EC2s across at least 2 AZs. If one data center loses power, the other takes the load. Redundant Networking: Deploy a NAT Gateway in each AZ to ensure network isolation. Database Standby: Enable Multi-AZ for RDS. This creates a synchronous standby that fails over automatically in <60 seconds. Phase B: Regional Resilience (Disaster Recovery) This is where you graduate from "HA" to "DR." If the region goes dark, you need a plan. The Pilot Light Strategy: Replicate your data (RDS Read Replicas + S3 Replication) to a secondary region (e.g., us-west-2). Keep the compute resources "off" or minimal to save costs. DNS Failover: Use Route 53 to health-check your primary region. If it fails, flip the traffic to the secondary region. The Bottom Line: Resilience isn't just about keeping servers up; it's about assuming they will go down and designing the survival path. #AWS #SystemDesign #CloudArchitecture #DisasterRecovery #DevOps #Engineering
-
🛡️ How to Protect Your Business from Cloud Outages The AWS US-EAST-1 outage affected hundreds of services for 20+ hours. Here’s how to ensure your business stays resilient when the cloud fails: 1. Multi-Region Deployment Deploy across multiple AWS regions (US-EAST-1 + US-WEST-2). If one fails, traffic automatically routes to another. 2. Multi-Cloud Strategy Don’t put all eggs in one basket. Distribute critical workloads across AWS, Azure, and GCP. 3. Robust Monitoring Monitor everything. Use third-party tools, not just provider monitoring. Get alerts before customers complain. 4. Graceful Degradation Design systems to operate in reduced capacity mode. If authentication fails, allow cached credentials temporarily. 5. Database Resilience Replicate databases across regions. Test your failover regularly — untested backups are just hopes. 6. DNS Redundancy Use multiple DNS providers. DNS failures were a root cause of this outage. 7. Disaster Recovery Plan Document runbooks, define RTOs/RPOs, and conduct regular DR drills. Can you restore your app in a different region in under 1 hour? 8. Map Dependencies Know what depends on what. If AWS US-EAST-1 went down right now, do you know exactly what would break? 9. Status Page Keep customers informed during outages. Transparency builds trust. 10. Start Small You don’t need everything at once. Start with: • Dependency mapping • Monitoring & alerting• One backup region for critical services • Test your DR plan Final Thought 💭 The AWS outage reminded us that the cloud is not infallible. No matter how reliable your provider claims to be (AWS has 99.99% uptime SLA), outages will happen. The question isn’t if the next outage will occur, but when — and whether your business will be ready. What’s your organization doing to prepare for cloud outages? Share your strategies in the comments! 👇 #CloudComputing #AWS #DisasterRecovery #BusinessContinuity #DevOps #CloudResilience #SRE #TechStrategy #Infrastructure
-
Bomb-proof your website with mirroring A mirror is like a backup website—not just a backup with the data but of everything, including the server and domain name. I built two additional mirrors for my blog website brandonrohrer.com. The symmetry of triple mode redundancy appeals to me, especially after hearing that it’s a principal used by NASA and growing up on some triples-themed sci-fi. You can take down one, but you still have two backups to work with, plenty of breathing room. If you get really unlucky, two will go down leaving you with a third. Why build one when you can build three at three times the cost? I tried to build in as much redundancy as possible. I got three domain names, each from a different domain name registrar. - brandonrohrer.com from Namecheap, based in Phoenix. - brandonrohrer.org from Hover, based in Toronto. - brandonrohrer.at from Hostpoint, based outside Zurich. These are running on three different virtual private servers (VPSs). - A VPS located in New York from DigitalOcean, which is headquartered near Denver. - A VPS located in Prague from OVHCloud, which is headquartered in France on the border with Belgium. - A VPS located in Montreal, from Koumbit, which is also headquartered in Montreal. They all contain the same content, which lives in a repository called blog-website. For additional redundancy, I host this repo on three different git services. - GitHub, a Microsoft property. - GitLab, operated out of Utrecht. - Codeberg, a non-profit in Berlin. One benefit from having three mirrors is that I can treat one as a staging environment. If I want to make a risky change, such as automatically generating new firewall rules to block annoying traffic, I can do it on my least traveled mirror and see how it goes. If something goes horribly wrong and takes down that entire server, then I can put it back together from scratch, all while the other two mirrors stay operational. Manually updating three mirrors is a little tedious. It’s easy to see why websites with mirrors or other content distribution networks would automate this. However, several of the larger corporate outages we’ve seen recently are precisely because of these automated deployment mechanisms. Because there’s one system that impacts every website, a single misconfiguration can take down all of them. They aren't isolated from each other. I’m not running a business that needs instant updates, so I can afford to roll mine out slowly, by hand, and take that extra time and get that extra resilience. Spreading services across companies and across continents is a good way to prevent a single event from taking down your site. It's not a good feeling to be dependent on a single organization for your internet real estate. If AWS us-east-1 goes down and takes your site with it, that's a sad day. It's nice to know that your little piece of the web will stay standing through anything short of a global calamity.
-
95% of the world’s internet traffic moves through subsea cables. Not satellites. Not clouds. Just glass. Laid on ocean floors. Invisible. Critical. Exposed. This year, four major cables in the Red Sea were cut... ...and Asia-Europe lost 25% of bandwidth. And no, it wasn’t an war. Just anchors. Fishing nets. Seabed tectonics. If your DR plan doesn’t include the ocean floor... ...you’re not ready. Here’s how to build real resilience... ...with SubCom as your on-ground truthful partner: ✧ MAP Trace your traffic flows. Which systems carry your core paths? SEA-ME-WE 6? AAE-1? EIG? Tag landing stations, handoffs, terrestrial hops. Now mirror that in your L3 overlays using SubCom’s telemetry. ✧ SPLIT Don’t load balance across two POPs… ...that land on the same cable. Use SubCom’s open cable design to mix routes, geographies, and owners. Redundancy ≠ resilience unless it spans tectonic and political zones. ✧ DESIGN SubCom supports wavelength-level reroute. Define fault domains. Automate handoff to your SD-WAN controller. No more manual ticket escalations at 2 AM. ✧ SIMULATE Run Red Team drills for dual cable cuts. Measure time-to-recovery. Not hope. Coordinate with SubCom’s NOC before the anchor drops. ✧ MONITOR SubCom gives you real-time fault zones, vessel paths, route degradation. Pipe it into your NOC. Pair it with your IXP metrics. Predict the cut before the outage. We stress over power and cooling redundancy in data centers. But one snapped fibre under water can drop an entire region. At 400G, there’s no retry logic. There’s signal. Or outage. Design accordingly. What’s your failover plan if the ocean goes dark? Tag your network team, this is the layer nobody’s watching.
-
The AWS downtime this week shook more systems than expected - here’s what you can learn from this real-world case study. 1. Redundancy isn’t optional Even the most reliable platforms can face downtime. Distributing workloads across multiple AZs isn’t enough.. design for multi-region failover. 2. Visibility can’t be one-sided When any cloud provider goes dark, so do its dashboards. Use independent monitoring and alerting to stay informed when your provider can’t. 3. Recovery plans must be tested A document isn’t a disaster recovery strategy. Inject a little chaos ~ run failover drills and chaos tests before the real outage does it for you. 4. Dependencies amplify impact One failing service can ripple across everything. You must map critical dependencies and eliminate single points of failure early. These moments are a powerful reminder that reliability and disaster recovery aren’t checkboxes .. They’re habits built into every design decision.
-
"The protection had become the outage." I wrote this exact phrase in my recent article on DDoS defense. Last week, we saw a different version of it play out on a global scale. On February 20th, a configuration bug at Cloudflare unintentionally withdrew BGP routes for over 1,100 Bring Your Own IP (BYOIP) prefixes. For over 6 hours, impacted enterprise services simply vanished from the global routing table. I’m not bringing this up to throw stones: at massive scale, complex systems will inevitably fail. I'm bringing this up because it perfectly illustrates the architectural vulnerability I've been warning about: When you rely on "Always-On" external scrubbing centers, you surrender your inbound network sovereignty. By routing 100% of your inbound traffic through a third-party black box you become a "digital tenant." If their automation pushes a bad config, you go offline globally. You're left suffering from "Scrubbing Blindness," refreshing a vendor's status page while your customers drown. You cannot bolt DDoS protection onto a fragile network. Your connectivity strategy IS your security strategy. If you want an architecture that survives when everything else fails, here is the blueprint to reclaim your edge: 🛑 1. On-Demand > Always-On > External scrubbing (Cloudflare, Lumen, Akamai) is the nuclear option, not your default route. Keep your baseline traffic fast, local, and on your sovereign edge. Use BGP communities to dynamically trigger scrubbing only when an attack exceeds your local 100G capacity 🌍 2. The Diversity Dividend > Connecting to 1-2 upstream providers is an illusion of redundancy. Distribute your ingress across a mix of Tier-1s, regional providers, and IXPs. Multiple entry points give you the BGP maneuverability to route around vendor failures and exponentially increase an attacker's cost. 🔪 3. Surgical Scrubbing (The /32 Trick) Stop sending your entire /24 to the scrubber just because one IP is attacked! And avoid the RTBH trap ("Voluntary Extinction"). Instead, polarize the /24 and use BGP automation to inject a /32 slice exclusively to your scrubbing provider. Mitigate the target while the rest of your network traverses clean, low-latency transit links. ⚡ 4. Live at DEFCON 2 You don't react to attacks; your network does. Run a "Three-Speed" detection stack (Kentik + Akvorado + FastNetMon). Have your BGP Flowspec rules templated and your automated BGP community tags ready to fire in sub-seconds before human hands even touch a keyboard. If you’re relying entirely on a default route to a cloud provider, it’s time to become the landlord. Defense must be designed in, not bolted on after an outage teaches you what you forgot. 📖 Dive into my full blueprints for building a combat-ready network in the comment #BGP #NetworkEngineering #Cloudflare #Outage #DDoSProtection #NetworkSecurity #CyberSecurity #NetworkArchitecture #InternetRouting #EdgeSovereignty
-
𝐋𝐞𝐬𝐬𝐨𝐧𝐬 𝐟𝐫𝐨𝐦 𝐭𝐡𝐞 𝐀𝐖𝐒 𝐮𝐬-𝐞𝐚𝐬𝐭-𝟏 𝐎𝐮𝐭𝐚𝐠𝐞: 𝐃𝐞𝐬𝐢𝐠𝐧𝐢𝐧𝐠 𝐚 𝐌𝐮𝐥𝐭𝐢-𝐂𝐥𝐨𝐮𝐝 𝐒𝐞𝐫𝐯𝐞𝐫𝐥𝐞𝐬𝐬 𝐀𝐫𝐜𝐡𝐢𝐭𝐞𝐜𝐭𝐮𝐫𝐞 𝐟𝐨𝐫 𝐑𝐞𝐬𝐢𝐥𝐢𝐞𝐧𝐜𝐞 When the AWS us-east-1 outage disrupted major global platforms last year, it was a wake-up call for every architect and engineer — no single cloud can guarantee 100% uptime. That incident underscored the need for multi-cloud resilience, where systems can shift workloads intelligently between providers like AWS and Azure without impacting end-user experience. In response, we designed a multi-cloud, serverless, GitOps-driven architecture that embodies the Well-Architected Framework principles — balancing reliability, performance efficiency, cost optimization, and operational excellence across clouds. 𝐃𝐚𝐭𝐚𝐟𝐥𝐨𝐰: The user’s app connects seamlessly from any source to our gateway app, which distributes requests equally between Azure and AWS. This dual-cloud setup ensures both robustness and availability, with all responses routed through an API Manager gateway for a unified and smooth experience. 𝐓𝐡𝐞 𝐒𝐞𝐫𝐯𝐞𝐫𝐥𝐞𝐬𝐬 𝐅𝐫𝐚𝐦𝐞𝐰𝐨𝐫𝐤: At the core of this architecture is the Serverless Framework. It abstracts infrastructure complexity, automates deployments, and supports GitOps-driven workflows — enabling a truly multi-cloud serverless deployment model that’s scalable and cloud-agnostic. 𝐂𝐈/𝐂𝐃 𝐰𝐢𝐭𝐡 𝐆𝐢𝐭𝐎𝐩𝐬: The CI/CD pipeline is built around GitOps principles, automating build, test, and deploy stages across multiple cloud providers. It ensures that code changes flow securely and reliably, maintaining consistency and compliance throughout the delivery process. 𝐏𝐨𝐭𝐞𝐧𝐭𝐢𝐚𝐥 𝐔𝐬𝐞 𝐂𝐚𝐬𝐞𝐬: Build cloud-agnostic APIs for client applications running across environments. Deploy microservices to multiple cloud platforms with a single manifest file. Maintain cross-cloud redundancy to prevent downtime during regional failures. Run serverless functions in the most cost-efficient or lowest-latency region dynamically. 𝐁𝐥𝐮𝐞-𝐆𝐫𝐞𝐞𝐧 𝐃𝐞𝐩𝐥𝐨𝐲𝐦𝐞𝐧𝐭: Each cloud platform hosts two duplicate sets of microservices — creating active-passive environments that allow instant failover. This approach ensures continuous availability and low-risk deployments across cloud regions and providers. In today’s world, multi-cloud is not just a choice — it’s a necessity for businesses aiming to stay resilient, cost-optimized, and future-ready. The Serverless Framework, combined with GitOps and Well-Architected principles, helps achieve just that. 💡 Follow me for upcoming posts where I’ll share new, innovative architecture blueprints — real-world examples showing how to design well-architected, reliable, and cost-efficient infrastructure for your business platforms. #cloudcomputing #aws #azure #cloudarchitecture #serverless #gitops #multicloud #devops #wellarchitected
-
Earlier this week, a major AWS outage disrupted services across the globe, affecting giants like Netflix, Slack, and even parts of Amazon itself. If you noticed websites loading endlessly or apps refusing to respond, that’s what happens when a large portion of the internet’s backbone takes a break. 𝐋𝐞𝐭’𝐬 𝐛𝐫𝐞𝐚𝐤 𝐭𝐡𝐢𝐬 𝐝𝐨𝐰𝐧 👇 What really happened? The issue originated from us-east-1 — AWS’s most heavily used region. A minor network disruption there triggered cascading failures across EC2, RDS, and ELB services. To put this in perspective — 𝟑𝟑% 𝐨𝐟 𝐚𝐥𝐥 𝐀𝐖𝐒 𝐰𝐨𝐫𝐤𝐥𝐨𝐚𝐝𝐬 𝐫𝐮𝐧 𝐢𝐧 𝐭𝐡𝐚𝐭 𝐫𝐞𝐠𝐢𝐨𝐧 𝐚𝐥𝐨𝐧𝐞. So when us-east-1 goes down, so does half the internet. 𝐑𝐞𝐚𝐥-𝐰𝐨𝐫𝐥𝐝 𝐢𝐦𝐩𝐚𝐜𝐭: * Streaming platforms like Netflix experienced buffering issues. * Internal tools on Slack and Atlassian Cloud became unreachable. * Even smart devices like Alexa stopped responding to commands. 𝐖𝐡𝐚𝐭 𝐜𝐚𝐧 𝐰𝐞 𝐥𝐞𝐚𝐫𝐧 𝐚𝐬 𝐞𝐧𝐠𝐢𝐧𝐞𝐞𝐫𝐬? Resiliency isn’t about preventing failure, it’s about designing to survive it. Here’s what tech leaders and DevOps teams should plan for: 1. Multi-region redundancy — Spread your workloads; don’t let one region own your uptime. 2. Chaos Engineering — Simulate outages before they happen. Netflix’s “Chaos Monkey” still remains a gold standard. 3. Observability-first mindset — Build dashboards that alert you before your users do. 4. Backup communication plans — When your monitoring and alerting depend on AWS, ensure they can survive AWS being down. Cloud reliability isn’t just a DevOps issue anymore, it’s a business continuity issue. Curious to hear, did your team face any production challenges during this outage?
Explore categories
- Hospitality & Tourism
- Productivity
- Finance
- Soft Skills & Emotional Intelligence
- Project Management
- Education
- Technology
- Leadership
- User Experience
- Recruitment & HR
- Customer Experience
- Real Estate
- Marketing
- Sales
- Retail & Merchandising
- Science
- Supply Chain Management
- Future Of Work
- Consulting
- Writing
- Economics
- Artificial Intelligence
- Employee Experience
- Healthcare
- Workplace Trends
- Fundraising
- Networking
- Corporate Social Responsibility
- Negotiation
- Communication
- Engineering
- Career
- Business Strategy
- Change Management
- Organizational Culture
- Design
- Innovation
- Event Planning
- Training & Development