GitHub just archived 2 9's in a while. 89.91% uptime. 95 incidents in the last 90 days that's roughly one incident per day. For reference, "three 9's" (99.9%) means 8.7 hours of downtime per year. GitHub's at 876 hours of theoretical downtime at this rate. So what's likely going on under the hood? Actions & CI load Half the internet is using GitHub Actions as default so the scale has grown significantly now i get why some co. still prefer jenkins Lot of Dependencies Git LFS, Packages, Copilot, Codespaces, Pages, Actions these are loosely coupled but share infra. One wobble and the status page lights up. MySQL at this scale GitHub famously runs on a heavily customized MySQL stack. Keeping that healthy while the product scale grows is very difficult in itself That just shows the role of oncall engineers are very crucial one and there are still things which is hard to automate There might be other reasons into play here aswell but these are my thoughts #DevOps #SRE #GitHub #BackendEngineering #SoftwareEngineering
GitHub's 89.91% uptime: Exploring the challenges of scale
More Relevant Posts
-
GitHub's CTO Vladimir Fedorov apologized for reliability this morning. Hours later, GitHub disclosed that a single git push could hijack its servers. Two posts. Same morning. Same platform. April 23: a merge queue regression silently reverted 2,092 pull requests across 658 repositories. Squash commits were generated from the wrong base. Merged code appeared to have never existed. April 27: search across PRs and issues fell over, likely under a botnet load. April 28: Wiz researchers showed that a crafted git push could run code on GitHub's servers, outside any sandbox. CVE-2026-3854. CVSS 8.7. Cross-tenant access to millions of repos. github(.com) was patched in 75 minutes. 88% of GitHub Enterprise Server instances are still vulnerable. Git is the boring layer. It is supposed to be deterministic. What you push is what's stored. What's merged stays merged. A push is a push. Three breaches of that contract in five days, on a platform that has not had a CEO in nearly a year. Vlad Fedorov ends his availability post with: "availability first, then capacity, then new features." That order is itself the news. Would appreciate a follow if you want to read more interesting tech stories coming out of this AI era. --- #GitHub #Security #Engineering #Infrastructure #DevOps
To view or add a comment, sign in
-
-
I built a lightweight version of the popular GitOps platforms entirely on top of GitHub Actions and Docker, so I never have to SSH into my home server to deploy anything. The stack is two small tools I built (github-multi-runner and docker-compose-deploy) plus the convention of keeping a real compose file in every service repo. No control plane, no Kubernetes, totally free on GitHub. Pretty fun pattern if you run your own stuff at home.
To view or add a comment, sign in
-
> GitHub stopped updating its own status page due to terrible availability ... 90.1% uptime - This means ... issues/degradations for 2.5 hours daily ... > GitHub struggles to keep up with the increase in load from AI agents generating more code and pull requests ... Claude Code bot contributions growth in the past 3 months has been enormous ... Stream of outages ... https://lnkd.in/eYHzasTh
To view or add a comment, sign in
-
I honestly thought platforms of GitHub's scale would have solved the core scaling riddles by now. Turns out, even they are battling fundamental architectural demons. The article dives into GitHub's recent availability woes. We're talking cascading failures, tight coupling, and an inability to shed load effectively. Imagine your critical CI/CD pipelines suddenly grinding to a halt because an authentication database choked. That's the kind of disruption they're talking about. This isn't just another post-mortem. It's a stark reminder that rapid growth amplifies every architectural shortcut. I realized that what we often chalk up to "bad luck" in our own systems is often a predictable outcome of underlying structural choices. GitHub's candidness about "tight coupling" and "insufficient isolation" really hit home. In the digital financial services, I've seen how a seemingly minor dependency, like a third-party fraud detection API, can ripple through an entire transaction flow if not properly isolated and rate-limited. We often talk about microservices as the panacea, but if they're still tightly coupled at the data or authentication layer, you haven't solved the problem; you've just distributed it. Their challenge with effective "backpressure mechanisms" and load shedding, especially as AI-driven tooling piles on, highlights that just throwing more compute at it isn't enough. You need smart circuit breakers and adaptive traffic management, not just reactive scaling. The article also points out the disparity between official status pages and real-world developer experience, which resonates deeply with anyone who's ever debugged a "green" system. How do you proactively build resilience against these kinds of systemic failures when your system is growing at warp speed? What's your secret sauce for true architectural isolation? https://lnkd.in/e3EmWfjz #SystemArchitecture #Scalability #ReliabilityEngineering #DevOps
To view or add a comment, sign in
-
-
GitHub restructured its status page a while back- making it significantly harder to get a clear picture of platform-wide uptime. The unofficial reconstruction at The Missing GitHub Status Page (community-built, since GitHub won't publish an aggregate) reports 89.31% uptime as of this post. When your reliability is so bad the scoreboard has to be crowd-sourced, that's not a technical problem. 𝗧𝗵𝗮𝘁'𝘀 𝗮𝗻 𝗮𝗱𝗺𝗶𝘀𝘀𝗶𝗼𝗻. Meanwhile: * A Claude Code agent recently deleted a developer's production database and 2.5 years of volume snapshots. * Amazon is holding emergency meetings about production outages, while spending $200B on AI infrastructure. 𝗧𝗵𝗶𝘀 𝗶𝘀𝗻'𝘁 𝗮 𝗯𝗮𝗱 𝗾𝘂𝗮𝗿𝘁𝗲𝗿. 𝗧𝗵𝗶𝘀 𝗶𝘀 𝗮 𝗽𝗮𝘁𝘁𝗲𝗿𝗻. Companies are shipping much more code without making matching investments in operational discipline. 𝗘𝗻𝘁𝗲𝗿𝗽𝗿𝗶𝘀𝗲 𝘀𝗼𝗳𝘁𝘄𝗮𝗿𝗲 𝗲𝘅𝗶𝘀𝘁𝘀 𝘁𝗼 𝘀𝗼𝗹𝘃𝗲 𝗰𝘂𝘀𝘁𝗼𝗺𝗲𝗿 𝗽𝗿𝗼𝗯𝗹𝗲𝗺𝘀. You can't do that if your service is down, your CI tooling is broken again, or an AI agent just nuked your database. 𝗥𝗲𝗹𝗶𝗮𝗯𝗶𝗹𝗶𝘁𝘆 𝗶𝘀 𝘁𝗵𝗲 𝗺𝗼𝘀𝘁 𝗶𝗺𝗽𝗼𝗿𝘁𝗮𝗻𝘁 𝗳𝗲𝗮𝘁𝘂𝗿𝗲. If you're not measuring it, making business decisions concerning it, and saying "no" based on it- you're gambling with customer trust. 𝗪𝗲'𝘃𝗲 𝘁𝗼𝘁𝗮𝗹𝗹𝘆 𝗹𝗼𝘀𝘁 𝘁𝗵𝗲 𝗽𝗹𝗼𝘁.
To view or add a comment, sign in
-
-
Seeing this reliability issue come up more and more is very concerning. Will we look back at the heady days of the early AI era as a time when so many people forgot the hard learned lessons of running production systems?
SRE/DevOps Consultant for Scaling SaaS Companies | From burnout & bottlenecks → to reliable, fast-moving teams and systems | Ex-FAANG | Drove engineering transformation for $1B+ company
GitHub restructured its status page a while back- making it significantly harder to get a clear picture of platform-wide uptime. The unofficial reconstruction at The Missing GitHub Status Page (community-built, since GitHub won't publish an aggregate) reports 89.31% uptime as of this post. When your reliability is so bad the scoreboard has to be crowd-sourced, that's not a technical problem. 𝗧𝗵𝗮𝘁'𝘀 𝗮𝗻 𝗮𝗱𝗺𝗶𝘀𝘀𝗶𝗼𝗻. Meanwhile: * A Claude Code agent recently deleted a developer's production database and 2.5 years of volume snapshots. * Amazon is holding emergency meetings about production outages, while spending $200B on AI infrastructure. 𝗧𝗵𝗶𝘀 𝗶𝘀𝗻'𝘁 𝗮 𝗯𝗮𝗱 𝗾𝘂𝗮𝗿𝘁𝗲𝗿. 𝗧𝗵𝗶𝘀 𝗶𝘀 𝗮 𝗽𝗮𝘁𝘁𝗲𝗿𝗻. Companies are shipping much more code without making matching investments in operational discipline. 𝗘𝗻𝘁𝗲𝗿𝗽𝗿𝗶𝘀𝗲 𝘀𝗼𝗳𝘁𝘄𝗮𝗿𝗲 𝗲𝘅𝗶𝘀𝘁𝘀 𝘁𝗼 𝘀𝗼𝗹𝘃𝗲 𝗰𝘂𝘀𝘁𝗼𝗺𝗲𝗿 𝗽𝗿𝗼𝗯𝗹𝗲𝗺𝘀. You can't do that if your service is down, your CI tooling is broken again, or an AI agent just nuked your database. 𝗥𝗲𝗹𝗶𝗮𝗯𝗶𝗹𝗶𝘁𝘆 𝗶𝘀 𝘁𝗵𝗲 𝗺𝗼𝘀𝘁 𝗶𝗺𝗽𝗼𝗿𝘁𝗮𝗻𝘁 𝗳𝗲𝗮𝘁𝘂𝗿𝗲. If you're not measuring it, making business decisions concerning it, and saying "no" based on it- you're gambling with customer trust. 𝗪𝗲'𝘃𝗲 𝘁𝗼𝘁𝗮𝗹𝗹𝘆 𝗹𝗼𝘀𝘁 𝘁𝗵𝗲 𝗽𝗹𝗼𝘁.
To view or add a comment, sign in
-
-
Just shipped a full production-style web app deployment using Terraform + Ansible roles and this one fought back. 💪 For this project, I provisioned an Azure Ubuntu VM with Terraform (Australia East) and used structured Ansible roles to deploy the EpicBook Node.js app end to end: 🔧 common — system prep, Node.js 18, PM2, SSH hardening 🗄️ mysql — MySQL install, database creation, root auth configuration 🌐 nginx — reverse proxy config via Jinja2 template, handler-driven reloads 📦 epicbook — git clone, npm install, DB seeding, PM2 app start The project looked straightforward on paper. In practice, I ran into 7 distinct issues before getting a clean changed=0 run: The repo turned out to be a Node.js/Express app not the static site I expected so I had to pivot the nginx config from a static file server to a reverse proxy mid-assignment. The VM crashed during MySQL installation because Standard_B1s simply doesn't have enough resources for it scaled up and carried on. MySQL's default auth_socket on Ubuntu 22.04 broke idempotency on the root password task. Git's safe directory security feature blocked re-runs after ownership changed to www-data. The dpkg lock from Ubuntu's background auto-updater stalled the first apt run. Every single one of these is something you'd encounter in a real production deployment. That's the point. The real milestone: running the playbook a second time and seeing ok=21 changed=0 failed=0 across the board. That's idempotency and it's the difference between a script and a production-ready playbook. 🌍 Site live at: http://20.92.86.197 🔗 [link to Medium post: https://lnkd.in/eqUiGcD8] Big shoutout to my mentors, Pravin Mishra, lead co-mentor Praveen Pandey; and co-mentors Abhishek Makwana and Mobarak Hosen P.S. Part of the DevOps Micro Internship (DMI) Cohort 2. Join the community on Discord: https://lnkd.in/dQYAuQ7j #Ansible #Terraform #Azure #DevOps #IaC #ConfigurationManagement #NodeJS #LearningInPublic #CloudEngineering #DMI
To view or add a comment, sign in
-
Explore content categories
- Career
- Productivity
- Finance
- Soft Skills & Emotional Intelligence
- Project Management
- Education
- Technology
- Leadership
- Ecommerce
- User Experience
- Recruitment & HR
- Customer Experience
- Real Estate
- Marketing
- Sales
- Retail & Merchandising
- Science
- Supply Chain Management
- Future Of Work
- Consulting
- Writing
- Economics
- Artificial Intelligence
- Employee Experience
- Workplace Trends
- Fundraising
- Networking
- Corporate Social Responsibility
- Negotiation
- Communication
- Engineering
- Hospitality & Tourism
- Business Strategy
- Change Management
- Organizational Culture
- Design
- Innovation
- Event Planning
- Training & Development