GitHub's 89.91% uptime: Exploring the challenges of scale

1mo

GitHub just archived 2 9's in a while. 89.91% uptime. 95 incidents in the last 90 days that's roughly one incident per day. For reference, "three 9's" (99.9%) means 8.7 hours of downtime per year. GitHub's at 876 hours of theoretical downtime at this rate. So what's likely going on under the hood? Actions & CI load Half the internet is using GitHub Actions as default so the scale has grown significantly now i get why some co. still prefer jenkins Lot of Dependencies Git LFS, Packages, Copilot, Codespaces, Pages, Actions these are loosely coupled but share infra. One wobble and the status page lights up. MySQL at this scale GitHub famously runs on a heavily customized MySQL stack. Keeping that healthy while the product scale grows is very difficult in itself That just shows the role of oncall engineers are very crucial one and there are still things which is hard to automate There might be other reasons into play here aswell but these are my thoughts #DevOps #SRE #GitHub #BackendEngineering #SoftwareEngineering

To view or add a comment, sign in

More Relevant Posts

sameer khan
3d Edited
Report this post
GitHub's CTO Vladimir Fedorov apologized for reliability this morning. Hours later, GitHub disclosed that a single git push could hijack its servers. Two posts. Same morning. Same platform. April 23: a merge queue regression silently reverted 2,092 pull requests across 658 repositories. Squash commits were generated from the wrong base. Merged code appeared to have never existed. April 27: search across PRs and issues fell over, likely under a botnet load. April 28: Wiz researchers showed that a crafted git push could run code on GitHub's servers, outside any sandbox. CVE-2026-3854. CVSS 8.7. Cross-tenant access to millions of repos. github(.com) was patched in 75 minutes. 88% of GitHub Enterprise Server instances are still vulnerable. Git is the boring layer. It is supposed to be deterministic. What you push is what's stored. What's merged stays merged. A push is a push. Three breaches of that contract in five days, on a platform that has not had a CEO in nearly a year. Vlad Fedorov ends his availability post with: "availability first, then capacity, then new features." That order is itself the news. Would appreciate a follow if you want to read more interesting tech stories coming out of this AI era. --- #GitHub #Security #Engineering #Infrastructure #DevOps
3 Comments
Like Comment
To view or add a comment, sign in
Nick Silverman
1w Edited
Report this post
I built a lightweight version of the popular GitOps platforms entirely on top of GitHub Actions and Docker, so I never have to SSH into my home server to deploy anything. The stack is two small tools I built (github-multi-runner and docker-compose-deploy) plus the convention of keeping a real compose file in every service repo. No control plane, no Kubernetes, totally free on GitHub. Pretty fun pattern if you run your own stuff at home.

Home Server GitOps-Lite on Nothing but GitHub and Docker dev.to

3 Comments
Like Comment
To view or add a comment, sign in
Jean-Luc BILLY
3w
Report this post
> GitHub stopped updating its own status page due to terrible availability ... 90.1% uptime - This means ... issues/degradations for 2.5 hours daily ... > GitHub struggles to keep up with the increase in load from AI agents generating more code and pull requests ... Claude Code bot contributions growth in the past 3 months has been enormous ... Stream of outages ... https://lnkd.in/eYHzasTh

Does GitHub still merit “top git platform for AI-native development” status? newsletter.pragmaticengineer.com
Like Comment
To view or add a comment, sign in
Taimir Alain Morales ROche
1w
Report this post
I honestly thought platforms of GitHub's scale would have solved the core scaling riddles by now. Turns out, even they are battling fundamental architectural demons. The article dives into GitHub's recent availability woes. We're talking cascading failures, tight coupling, and an inability to shed load effectively. Imagine your critical CI/CD pipelines suddenly grinding to a halt because an authentication database choked. That's the kind of disruption they're talking about. This isn't just another post-mortem. It's a stark reminder that rapid growth amplifies every architectural shortcut. I realized that what we often chalk up to "bad luck" in our own systems is often a predictable outcome of underlying structural choices. GitHub's candidness about "tight coupling" and "insufficient isolation" really hit home. In the digital financial services, I've seen how a seemingly minor dependency, like a third-party fraud detection API, can ripple through an entire transaction flow if not properly isolated and rate-limited. We often talk about microservices as the panacea, but if they're still tightly coupled at the data or authentication layer, you haven't solved the problem; you've just distributed it. Their challenge with effective "backpressure mechanisms" and load shedding, especially as AI-driven tooling piles on, highlights that just throwing more compute at it isn't enough. You need smart circuit breakers and adaptive traffic management, not just reactive scaling. The article also points out the disparity between official status pages and real-world developer experience, which resonates deeply with anyone who's ever debugged a "green" system. How do you proactively build resilience against these kinds of systemic failures when your system is growing at warp speed? What's your secret sauce for true architectural isolation? https://lnkd.in/e3EmWfjz #SystemArchitecture #Scalability #ReliabilityEngineering #DevOps
Like Comment
To view or add a comment, sign in
Amin Astaneh
3w
Report this post
GitHub restructured its status page a while back- making it significantly harder to get a clear picture of platform-wide uptime. The unofficial reconstruction at The Missing GitHub Status Page (community-built, since GitHub won't publish an aggregate) reports 89.31% uptime as of this post. When your reliability is so bad the scoreboard has to be crowd-sourced, that's not a technical problem. 𝗧𝗵𝗮𝘁'𝘀 𝗮𝗻 𝗮𝗱𝗺𝗶𝘀𝘀𝗶𝗼𝗻. Meanwhile: * A Claude Code agent recently deleted a developer's production database and 2.5 years of volume snapshots. * Amazon is holding emergency meetings about production outages, while spending $200B on AI infrastructure. 𝗧𝗵𝗶𝘀 𝗶𝘀𝗻'𝘁 𝗮 𝗯𝗮𝗱 𝗾𝘂𝗮𝗿𝘁𝗲𝗿. 𝗧𝗵𝗶𝘀 𝗶𝘀 𝗮 𝗽𝗮𝘁𝘁𝗲𝗿𝗻. Companies are shipping much more code without making matching investments in operational discipline. 𝗘𝗻𝘁𝗲𝗿𝗽𝗿𝗶𝘀𝗲 𝘀𝗼𝗳𝘁𝘄𝗮𝗿𝗲 𝗲𝘅𝗶𝘀𝘁𝘀 𝘁𝗼 𝘀𝗼𝗹𝘃𝗲 𝗰𝘂𝘀𝘁𝗼𝗺𝗲𝗿 𝗽𝗿𝗼𝗯𝗹𝗲𝗺𝘀. You can't do that if your service is down, your CI tooling is broken again, or an AI agent just nuked your database. 𝗥𝗲𝗹𝗶𝗮𝗯𝗶𝗹𝗶𝘁𝘆 𝗶𝘀 𝘁𝗵𝗲 𝗺𝗼𝘀𝘁 𝗶𝗺𝗽𝗼𝗿𝘁𝗮𝗻𝘁 𝗳𝗲𝗮𝘁𝘂𝗿𝗲. If you're not measuring it, making business decisions concerning it, and saying "no" based on it- you're gambling with customer trust. 𝗪𝗲'𝘃𝗲 𝘁𝗼𝘁𝗮𝗹𝗹𝘆 𝗹𝗼𝘀𝘁 𝘁𝗵𝗲 𝗽𝗹𝗼𝘁.
4 Comments
Like Comment
To view or add a comment, sign in
Peter Buckley
3w
Report this post
Seeing this reliability issue come up more and more is very concerning. Will we look back at the heady days of the early AI era as a time when so many people forgot the hard learned lessons of running production systems?
Amin Astaneh

SRE/DevOps Consultant for Scaling SaaS Companies | From burnout & bottlenecks → to reliable, fast-moving teams and systems | Ex-FAANG | Drove engineering transformation for $1B+ company
3w

GitHub restructured its status page a while back- making it significantly harder to get a clear picture of platform-wide uptime. The unofficial reconstruction at The Missing GitHub Status Page (community-built, since GitHub won't publish an aggregate) reports 89.31% uptime as of this post. When your reliability is so bad the scoreboard has to be crowd-sourced, that's not a technical problem. 𝗧𝗵𝗮𝘁'𝘀 𝗮𝗻 𝗮𝗱𝗺𝗶𝘀𝘀𝗶𝗼𝗻. Meanwhile: * A Claude Code agent recently deleted a developer's production database and 2.5 years of volume snapshots. * Amazon is holding emergency meetings about production outages, while spending $200B on AI infrastructure. 𝗧𝗵𝗶𝘀 𝗶𝘀𝗻'𝘁 𝗮 𝗯𝗮𝗱 𝗾𝘂𝗮𝗿𝘁𝗲𝗿. 𝗧𝗵𝗶𝘀 𝗶𝘀 𝗮 𝗽𝗮𝘁𝘁𝗲𝗿𝗻. Companies are shipping much more code without making matching investments in operational discipline. 𝗘𝗻𝘁𝗲𝗿𝗽𝗿𝗶𝘀𝗲 𝘀𝗼𝗳𝘁𝘄𝗮𝗿𝗲 𝗲𝘅𝗶𝘀𝘁𝘀 𝘁𝗼 𝘀𝗼𝗹𝘃𝗲 𝗰𝘂𝘀𝘁𝗼𝗺𝗲𝗿 𝗽𝗿𝗼𝗯𝗹𝗲𝗺𝘀. You can't do that if your service is down, your CI tooling is broken again, or an AI agent just nuked your database. 𝗥𝗲𝗹𝗶𝗮𝗯𝗶𝗹𝗶𝘁𝘆 𝗶𝘀 𝘁𝗵𝗲 𝗺𝗼𝘀𝘁 𝗶𝗺𝗽𝗼𝗿𝘁𝗮𝗻𝘁 𝗳𝗲𝗮𝘁𝘂𝗿𝗲. If you're not measuring it, making business decisions concerning it, and saying "no" based on it- you're gambling with customer trust. 𝗪𝗲'𝘃𝗲 𝘁𝗼𝘁𝗮𝗹𝗹𝘆 𝗹𝗼𝘀𝘁 𝘁𝗵𝗲 𝗽𝗹𝗼𝘁.
1 Comment
Like Comment
To view or add a comment, sign in
Pelumi Olatunji
1mo
Report this post
Just shipped a full production-style web app deployment using Terraform + Ansible roles and this one fought back. 💪 For this project, I provisioned an Azure Ubuntu VM with Terraform (Australia East) and used structured Ansible roles to deploy the EpicBook Node.js app end to end: 🔧 common — system prep, Node.js 18, PM2, SSH hardening 🗄️ mysql — MySQL install, database creation, root auth configuration 🌐 nginx — reverse proxy config via Jinja2 template, handler-driven reloads 📦 epicbook — git clone, npm install, DB seeding, PM2 app start The project looked straightforward on paper. In practice, I ran into 7 distinct issues before getting a clean changed=0 run: The repo turned out to be a Node.js/Express app not the static site I expected so I had to pivot the nginx config from a static file server to a reverse proxy mid-assignment. The VM crashed during MySQL installation because Standard_B1s simply doesn't have enough resources for it scaled up and carried on. MySQL's default auth_socket on Ubuntu 22.04 broke idempotency on the root password task. Git's safe directory security feature blocked re-runs after ownership changed to www-data. The dpkg lock from Ubuntu's background auto-updater stalled the first apt run. Every single one of these is something you'd encounter in a real production deployment. That's the point. The real milestone: running the playbook a second time and seeing ok=21 changed=0 failed=0 across the board. That's idempotency and it's the difference between a script and a production-ready playbook. 🌍 Site live at: http://20.92.86.197 🔗 [link to Medium post: https://lnkd.in/eqUiGcD8] Big shoutout to my mentors, Pravin Mishra, lead co-mentor Praveen Pandey; and co-mentors Abhishek Makwana and Mobarak Hosen P.S. Part of the DevOps Micro Internship (DMI) Cohort 2. Join the community on Discord: https://lnkd.in/dQYAuQ7j #Ansible #Terraform #Azure #DevOps #IaC #ConfigurationManagement #NodeJS #LearningInPublic #CloudEngineering #DMI
Like Comment
To view or add a comment, sign in

1,025 followers

38 Posts

View Profile Follow

GitHub's 89.91% uptime: Exploring the challenges of scale

More Relevant Posts

Explore content categories