Automating GitHub Outage Detection

Vivek Thirumoorthy

Published Jan 19, 2026

Background:

One of application wasn’t behaving as expected, and as part of the debugging process I had to run a workflow to validate a small fix. After spending nearly 10–15 minutes carefully reviewing logs, configuration, I finally applied the fix and prepared to rerun the Github workflow. But the Github workflow didn’t start.

I assumed something was wrong with my infrastructure, so I went deep — checking VM health, disk usage, networking, runner services, and agent status. Another 10–15 minutes disappeared just trying to understand why the runner wasn’t responding. Only then did it occur to me to check GitHub’s status page. And there it was — an ongoing GitHub incident affecting runners and workflows. Nothing in my system was broken at all.

In that moment, I realized realized that I had wasted almost half an hour debugging a problem that wasn’t even mine. That was the point where I decided this had to be automated. Because no engineer should lose 10–15 minutes of focus just to discover that the platform itself is having a bad day.

Integresting, GitHub already publishes real-time incident data: https://www.githubstatus.com/

So I decided to build the Automation using Github Action Itself which:

Runs on a schedule
Fetches unresolved incidents from GitHub’s status API
Filters only active problems
Formats a clean message

Below is a step-by-step “Solution” process building the workflow:

Step 1 — Let GitHub Watch Itself

The first decision was where to run this , Instead of running this from VM, Functions, External Systems. I chose to run it inside GitHub itself using GitHub Actions.

I scheduled the workflow to run every few minutes.

on:
  schedule:
    - cron: '*/5 * * * *'
  workflow_dispatch:

This ensured I’d get notified quickly, without overwhelming Slack.

Step 2 — Fetch Live Incident Data

While exploring GitHub’s status page, I discovered something important. Behind the UI, GitHub exposes a public JSON API with unresolved incidents:

https://www.githubstatus.com/api/v2/incidents/unresolved.json

No authentication. No tokens. Just real-time platform health. Inside the workflow, I simply fetched it:

DATA=$(curl -s https://www.githubstatus.com/api/v2/incidents/unresolved.json)

At this point, I already had: Incident names, Impact levels , Affected components ,Current status. All in one response.

Step 3 — Extract Only What Matters

Raw JSON is useful, but I didn’t want noise. I only cared about Active incidents, Affected services, Current status, So I filtered the response:

INCIDENTS=$(echo "$DATA" | jq '.incidents[] | {name, status, impact}')

If no incidents were present, the workflow simply exited quietly. No alert. No spam.

Step 4 — Persist Incident IDs Locally

Instead of reacting immediately, I first store the current unresolved incident IDs in a file. This becomes my baseline for the next run.

echo "$DATA" | jq -r '.incidents[].id' > current_incidents.txtStep 4 — Format a Human-Readable Alert

Recommended by LinkedIn

Call For Participation for CfgMgmtCamp 2020

Alexandre B. 6 years ago

GitLab CICD Security Best Practices; a Real Scenario

Abolfazl Zamani 1 year ago

Integration of Ansible + Docker

Siddhant Sharma 5 years ago

Step 5 — Compare with the Previous Run

On the next iteration, the workflow checks: What incidents existed before vs What incidents exist now

if [ -f previous_incidents.txt ]; then
  PERSISTENT=$(comm -12 <(sort previous_incidents.txt) <(sort current_incidents.txt))
else
  PERSISTENT=""
fi

This is the key logic. Only incidents that: Existed in the previous run and still exist now are considered real, persistent incidents. Everything else is ignored.

Step 6 — Alert Only When an Incident Persists Only confirmed platform problems.

Now came the most important part: make the alert immediately useful. I built a clean message that shows:

Incident title
Current status
Impact level

MESSAGE="🚨 GitHub Incident Detected\n"

for row in $(echo "$INCIDENTS" | jq -c '.'); do
  NAME=$(echo "$row" | jq -r '.name')
  STATUS=$(echo "$row" | jq -r '.status')
  IMPACT=$(echo "$row" | jq -r '.impact')

  MESSAGE+="• $NAME — $STATUS ($IMPACT)\n"
done

The goal was simple:

In one glance, the team should know whether to debug… or wait.

Step 7 — Push It Directly to Slack

Finally, I sent the message to Slack using an incoming Slack webhook or Slack Actions

curl -X POST \
  -H "Content-Type: application/json" \
  -d "$(jq -n --arg text "$MESSAGE" '{text: $text}')" \
  "$SLACK_WEBHOOK_URL"

Now, whenever GitHub has an active incident, the team sees:

🚨 GitHub Incident Detected

• Actions — Degraded Performance (medium)
• Webhooks — Partial Outage (high)

🔗 https://www.githubstatus.com/

The Full Workflow (In One Place)

Here’s the complete version I ended up with:

name: GitHub Status Monitor

on:
  schedule:
    - cron: '*/5 * * * *'
  workflow_dispatch:

jobs:
  monitor:
    runs-on: ubuntu-latest

    steps:
      - name: Checkout (to persist incident state)
        uses: actions/checkout@v3

      - name: Check GitHub Status and notify Slack
        env:
          SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK_URL }}
        run: |
          set -e

          DATA=$(curl -s https://www.githubstatus.com/api/v2/incidents/unresolved.json)
          COUNT=$(echo "$DATA" | jq '.incidents | length')

          if [ "$COUNT" -eq 0 ]; then
            echo "No active incidents. Clearing state."
            rm -f previous_incidents.txt
            exit 0
          fi

          # Store current incident IDs
          echo "$DATA" | jq -r '.incidents[].id' > current_incidents.txt

          # Compare with previous run
          if [ -f previous_incidents.txt ]; then
            PERSISTENT=$(comm -12 <(sort previous_incidents.txt) <(sort current_incidents.txt))
          else
            PERSISTENT=""
          fi

          # Alert only if incidents persist
          if [ -n "$PERSISTENT" ]; then
            MESSAGE="🚨 GitHub Incident Still Active\n\n"

            for ID in $PERSISTENT; do
              INCIDENT=$(echo "$DATA" | jq ".incidents[] | select(.id==\"$ID\")")

              NAME=$(echo "$INCIDENT" | jq -r '.name')
              STATUS=$(echo "$INCIDENT" | jq -r '.status')
              IMPACT=$(echo "$INCIDENT" | jq -r '.impact')

              MESSAGE+="• $NAME — $STATUS ($IMPACT)\n"
            done

            MESSAGE+="\n🔗 https://www.githubstatus.com/"

            curl -X POST \
              -H "Content-Type: application/json" \
              -d "$(jq -n --arg text "$MESSAGE" '{text: $text}')" \
              "$SLACK_WEBHOOK_URL"
          else
            echo "Incident detected, but waiting for confirmation in next iteration."
          fi

          # Update baseline for next run
          mv current_incidents.txt previous_incidents.txt

          git config user.name "github-actions[bot]"
          git config user.email "github-actions[bot]@users.noreply.github.com"
          git add previous_incidents.txt
          git commit -m "Update incident state" || true
          git push || true

The Result

Since adding this:

No blind debugging
No unnecessary runner restarts
No wasted infrastructure checks
No false RCAs

Instead, the team now starts every incident with one simple question:

“Is GitHub healthy right now?”

And more often than not…

That answer saves us 10-15 minutes.

Vivek Thirumoorthy 3mo

https://github.com/VivekInCloud/Github-Status-Monitor

1 Reaction

To view or add a comment, sign in

Automating GitHub Outage Detection

Vivek Thirumoorthy

Step 1 — Let GitHub Watch Itself

Step 2 — Fetch Live Incident Data

Step 3 — Extract Only What Matters

Step 4 — Persist Incident IDs Locally

Recommended by LinkedIn

Step 5 — Compare with the Previous Run

Step 6 — Alert Only When an Incident Persists Only confirmed platform problems.

Step 7 — Push It Directly to Slack

The Full Workflow (In One Place)

The Result

More articles by Vivek Thirumoorthy

Others also viewed

Using SSH Keys with GitLab and Ansible Automation Platform

How to secure Kubernetes deployment with signature verification - Part 1

ANSIBLE AND DOCKER

HOW INDUSTRIES ARE SOLVING CHALLENGES USING ANSIBLE..

🚀 Automating Secure Deployments with Ansible Vault + SSH Keys 🔐✨

DevOps Unplugged: Hardening Jenkins with Security, Secrets, and RBAC (Part 6/9)

Set up a Jenkins job Integration with Gmail Notification

Why Living on Six-Month Branches Is Not a Viable Strategy

OpenShift Operator Hardening - a Case Study

RENEWED THRILL AT WORK WITH ANSIBLE

Explore content categories

Step 1 — Let GitHub Watch Itself

Step 2 — Fetch Live Incident Data

Step 3 — Extract Only What Matters

Step 4 — Persist Incident IDs Locally

Recommended by LinkedIn

Step 5 — Compare with the Previous Run

Step 6 — Alert Only When an Incident Persists Only confirmed platform problems.

Step 7 — Push It Directly to Slack

The Full Workflow (In One Place)

The Result

More articles by Vivek Thirumoorthy

How I Automated LetsEncrypt TLS Cert Management for Azure Front Door, App Gateway, and AKS

Debugging Azure Networking: An Interesting Experience Worth Exploring

3 Solid IaC Deployment improvements

Building Secure AI through Azure APIM

Others also viewed

Using SSH Keys with GitLab and Ansible Automation Platform

How to secure Kubernetes deployment with signature verification - Part 1

ANSIBLE AND DOCKER

HOW INDUSTRIES ARE SOLVING CHALLENGES USING ANSIBLE..

🚀 Automating Secure Deployments with Ansible Vault + SSH Keys 🔐✨

DevOps Unplugged: Hardening Jenkins with Security, Secrets, and RBAC (Part 6/9)

Set up a Jenkins job Integration with Gmail Notification

Why Living on Six-Month Branches Is Not a Viable Strategy

OpenShift Operator Hardening - a Case Study

RENEWED THRILL AT WORK WITH ANSIBLE

Similar topics

Resolving Delays in AWS Workflow Automation

How to Automate Security Workflows

How to Secure Github Actions Workflows

Explore content categories