Automating GitHub Outage Detection
Automating GitHub Outage Detection

Automating GitHub Outage Detection

Background:

One of application wasn’t behaving as expected, and as part of the debugging process I had to run a workflow to validate a small fix. After spending nearly 10–15 minutes carefully reviewing logs, configuration, I finally applied the fix and prepared to rerun the Github workflow. But the Github workflow didn’t start.

I assumed something was wrong with my infrastructure, so I went deep — checking VM health, disk usage, networking, runner services, and agent status. Another 10–15 minutes disappeared just trying to understand why the runner wasn’t responding. Only then did it occur to me to check GitHub’s status page. And there it was — an ongoing GitHub incident affecting runners and workflows. Nothing in my system was broken at all.

In that moment, I realized realized that I had wasted almost half an hour debugging a problem that wasn’t even mine. That was the point where I decided this had to be automated. Because no engineer should lose 10–15 minutes of focus just to discover that the platform itself is having a bad day.

Integresting, GitHub already publishes real-time incident data: https://www.githubstatus.com/

So I decided to build the Automation using Github Action Itself which:

  • Runs on a schedule
  • Fetches unresolved incidents from GitHub’s status API
  • Filters only active problems
  • Formats a clean message

Below is a step-by-step “Solution” process building the workflow:

Step 1 — Let GitHub Watch Itself

The first decision was where to run this , Instead of running this from VM, Functions, External Systems. I chose to run it inside GitHub itself using GitHub Actions.

I scheduled the workflow to run every few minutes.

on:
  schedule:
    - cron: '*/5 * * * *'
  workflow_dispatch:
        

This ensured I’d get notified quickly, without overwhelming Slack.


Step 2 — Fetch Live Incident Data

While exploring GitHub’s status page, I discovered something important. Behind the UI, GitHub exposes a public JSON API with unresolved incidents:

https://www.githubstatus.com/api/v2/incidents/unresolved.json
        

No authentication. No tokens. Just real-time platform health. Inside the workflow, I simply fetched it:

DATA=$(curl -s https://www.githubstatus.com/api/v2/incidents/unresolved.json)
        

At this point, I already had: Incident names, Impact levels , Affected components ,Current status. All in one response.


Step 3 — Extract Only What Matters

Raw JSON is useful, but I didn’t want noise. I only cared about Active incidents, Affected services, Current status, So I filtered the response:

INCIDENTS=$(echo "$DATA" | jq '.incidents[] | {name, status, impact}')
        

If no incidents were present, the workflow simply exited quietly. No alert. No spam.


Step 4 — Persist Incident IDs Locally

Instead of reacting immediately, I first store the current unresolved incident IDs in a file. This becomes my baseline for the next run.

echo "$DATA" | jq -r '.incidents[].id' > current_incidents.txtStep 4 — Format a Human-Readable Alert        

Step 5 — Compare with the Previous Run

On the next iteration, the workflow checks: What incidents existed before vs What incidents exist now

if [ -f previous_incidents.txt ]; then
  PERSISTENT=$(comm -12 <(sort previous_incidents.txt) <(sort current_incidents.txt))
else
  PERSISTENT=""
fi
        

This is the key logic. Only incidents that: Existed in the previous run and still exist now are considered real, persistent incidents. Everything else is ignored.


Step 6 — Alert Only When an Incident Persists Only confirmed platform problems.

Now came the most important part: make the alert immediately useful. I built a clean message that shows:

  • Incident title
  • Current status
  • Impact level

MESSAGE="🚨 GitHub Incident Detected\n"

for row in $(echo "$INCIDENTS" | jq -c '.'); do
  NAME=$(echo "$row" | jq -r '.name')
  STATUS=$(echo "$row" | jq -r '.status')
  IMPACT=$(echo "$row" | jq -r '.impact')

  MESSAGE+="• $NAME — $STATUS ($IMPACT)\n"
done
        

The goal was simple:

In one glance, the team should know whether to debug… or wait.

Step 7 — Push It Directly to Slack

Finally, I sent the message to Slack using an incoming Slack webhook or Slack Actions

curl -X POST \
  -H "Content-Type: application/json" \
  -d "$(jq -n --arg text "$MESSAGE" '{text: $text}')" \
  "$SLACK_WEBHOOK_URL"
        

Now, whenever GitHub has an active incident, the team sees:

🚨 GitHub Incident Detected

• Actions — Degraded Performance (medium)
• Webhooks — Partial Outage (high)

🔗 https://www.githubstatus.com/
        

The Full Workflow (In One Place)

Here’s the complete version I ended up with:

name: GitHub Status Monitor

on:
  schedule:
    - cron: '*/5 * * * *'
  workflow_dispatch:

jobs:
  monitor:
    runs-on: ubuntu-latest

    steps:
      - name: Checkout (to persist incident state)
        uses: actions/checkout@v3

      - name: Check GitHub Status and notify Slack
        env:
          SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK_URL }}
        run: |
          set -e

          DATA=$(curl -s https://www.githubstatus.com/api/v2/incidents/unresolved.json)
          COUNT=$(echo "$DATA" | jq '.incidents | length')

          if [ "$COUNT" -eq 0 ]; then
            echo "No active incidents. Clearing state."
            rm -f previous_incidents.txt
            exit 0
          fi

          # Store current incident IDs
          echo "$DATA" | jq -r '.incidents[].id' > current_incidents.txt

          # Compare with previous run
          if [ -f previous_incidents.txt ]; then
            PERSISTENT=$(comm -12 <(sort previous_incidents.txt) <(sort current_incidents.txt))
          else
            PERSISTENT=""
          fi

          # Alert only if incidents persist
          if [ -n "$PERSISTENT" ]; then
            MESSAGE="🚨 GitHub Incident Still Active\n\n"

            for ID in $PERSISTENT; do
              INCIDENT=$(echo "$DATA" | jq ".incidents[] | select(.id==\"$ID\")")

              NAME=$(echo "$INCIDENT" | jq -r '.name')
              STATUS=$(echo "$INCIDENT" | jq -r '.status')
              IMPACT=$(echo "$INCIDENT" | jq -r '.impact')

              MESSAGE+="• $NAME — $STATUS ($IMPACT)\n"
            done

            MESSAGE+="\n🔗 https://www.githubstatus.com/"

            curl -X POST \
              -H "Content-Type: application/json" \
              -d "$(jq -n --arg text "$MESSAGE" '{text: $text}')" \
              "$SLACK_WEBHOOK_URL"
          else
            echo "Incident detected, but waiting for confirmation in next iteration."
          fi

          # Update baseline for next run
          mv current_incidents.txt previous_incidents.txt

          git config user.name "github-actions[bot]"
          git config user.email "github-actions[bot]@users.noreply.github.com"
          git add previous_incidents.txt
          git commit -m "Update incident state" || true
          git push || true
        

The Result

Since adding this:

  • No blind debugging
  • No unnecessary runner restarts
  • No wasted infrastructure checks
  • No false RCAs

Instead, the team now starts every incident with one simple question:

“Is GitHub healthy right now?”

And more often than not…

That answer saves us 10-15 minutes.

To view or add a comment, sign in

More articles by Vivek Thirumoorthy

Others also viewed

Explore content categories