GitHub's 89.91% uptime: Exploring the challenges of scale

GitHub just archived 2 9's in a while. 89.91% uptime. 95 incidents in the last 90 days that's roughly one incident per day. For reference, "three 9's" (99.9%) means 8.7 hours of downtime per year. GitHub's at 876 hours of theoretical downtime at this rate. So what's likely going on under the hood? Actions & CI load Half the internet is using GitHub Actions as default so the scale has grown significantly now i get why some co. still prefer jenkins Lot of Dependencies Git LFS, Packages, Copilot, Codespaces, Pages, Actions these are loosely coupled but share infra. One wobble and the status page lights up. MySQL at this scale GitHub famously runs on a heavily customized MySQL stack. Keeping that healthy while the product scale grows is very difficult in itself That just shows the role of oncall engineers are very crucial one and there are still things which is hard to automate There might be other reasons into play here aswell but these are my thoughts #DevOps #SRE #GitHub #BackendEngineering #SoftwareEngineering

  • text

To view or add a comment, sign in

Explore content categories