Automating GitHub Outage Detection
Background:
One of application wasn’t behaving as expected, and as part of the debugging process I had to run a workflow to validate a small fix. After spending nearly 10–15 minutes carefully reviewing logs, configuration, I finally applied the fix and prepared to rerun the Github workflow. But the Github workflow didn’t start.
I assumed something was wrong with my infrastructure, so I went deep — checking VM health, disk usage, networking, runner services, and agent status. Another 10–15 minutes disappeared just trying to understand why the runner wasn’t responding. Only then did it occur to me to check GitHub’s status page. And there it was — an ongoing GitHub incident affecting runners and workflows. Nothing in my system was broken at all.
In that moment, I realized realized that I had wasted almost half an hour debugging a problem that wasn’t even mine. That was the point where I decided this had to be automated. Because no engineer should lose 10–15 minutes of focus just to discover that the platform itself is having a bad day.
Integresting, GitHub already publishes real-time incident data: https://www.githubstatus.com/
So I decided to build the Automation using Github Action Itself which:
Below is a step-by-step “Solution” process building the workflow:
Step 1 — Let GitHub Watch Itself
The first decision was where to run this , Instead of running this from VM, Functions, External Systems. I chose to run it inside GitHub itself using GitHub Actions.
I scheduled the workflow to run every few minutes.
on:
schedule:
- cron: '*/5 * * * *'
workflow_dispatch:
This ensured I’d get notified quickly, without overwhelming Slack.
Step 2 — Fetch Live Incident Data
While exploring GitHub’s status page, I discovered something important. Behind the UI, GitHub exposes a public JSON API with unresolved incidents:
https://www.githubstatus.com/api/v2/incidents/unresolved.json
No authentication. No tokens. Just real-time platform health. Inside the workflow, I simply fetched it:
DATA=$(curl -s https://www.githubstatus.com/api/v2/incidents/unresolved.json)
At this point, I already had: Incident names, Impact levels , Affected components ,Current status. All in one response.
Step 3 — Extract Only What Matters
Raw JSON is useful, but I didn’t want noise. I only cared about Active incidents, Affected services, Current status, So I filtered the response:
INCIDENTS=$(echo "$DATA" | jq '.incidents[] | {name, status, impact}')
If no incidents were present, the workflow simply exited quietly. No alert. No spam.
Step 4 — Persist Incident IDs Locally
Instead of reacting immediately, I first store the current unresolved incident IDs in a file. This becomes my baseline for the next run.
echo "$DATA" | jq -r '.incidents[].id' > current_incidents.txtStep 4 — Format a Human-Readable Alert
Recommended by LinkedIn
Step 5 — Compare with the Previous Run
On the next iteration, the workflow checks: What incidents existed before vs What incidents exist now
if [ -f previous_incidents.txt ]; then
PERSISTENT=$(comm -12 <(sort previous_incidents.txt) <(sort current_incidents.txt))
else
PERSISTENT=""
fi
This is the key logic. Only incidents that: Existed in the previous run and still exist now are considered real, persistent incidents. Everything else is ignored.
Step 6 — Alert Only When an Incident Persists Only confirmed platform problems.
Now came the most important part: make the alert immediately useful. I built a clean message that shows:
MESSAGE="🚨 GitHub Incident Detected\n"
for row in $(echo "$INCIDENTS" | jq -c '.'); do
NAME=$(echo "$row" | jq -r '.name')
STATUS=$(echo "$row" | jq -r '.status')
IMPACT=$(echo "$row" | jq -r '.impact')
MESSAGE+="• $NAME — $STATUS ($IMPACT)\n"
done
The goal was simple:
In one glance, the team should know whether to debug… or wait.
Step 7 — Push It Directly to Slack
Finally, I sent the message to Slack using an incoming Slack webhook or Slack Actions
curl -X POST \
-H "Content-Type: application/json" \
-d "$(jq -n --arg text "$MESSAGE" '{text: $text}')" \
"$SLACK_WEBHOOK_URL"
Now, whenever GitHub has an active incident, the team sees:
🚨 GitHub Incident Detected
• Actions — Degraded Performance (medium)
• Webhooks — Partial Outage (high)
🔗 https://www.githubstatus.com/
The Full Workflow (In One Place)
Here’s the complete version I ended up with:
name: GitHub Status Monitor
on:
schedule:
- cron: '*/5 * * * *'
workflow_dispatch:
jobs:
monitor:
runs-on: ubuntu-latest
steps:
- name: Checkout (to persist incident state)
uses: actions/checkout@v3
- name: Check GitHub Status and notify Slack
env:
SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK_URL }}
run: |
set -e
DATA=$(curl -s https://www.githubstatus.com/api/v2/incidents/unresolved.json)
COUNT=$(echo "$DATA" | jq '.incidents | length')
if [ "$COUNT" -eq 0 ]; then
echo "No active incidents. Clearing state."
rm -f previous_incidents.txt
exit 0
fi
# Store current incident IDs
echo "$DATA" | jq -r '.incidents[].id' > current_incidents.txt
# Compare with previous run
if [ -f previous_incidents.txt ]; then
PERSISTENT=$(comm -12 <(sort previous_incidents.txt) <(sort current_incidents.txt))
else
PERSISTENT=""
fi
# Alert only if incidents persist
if [ -n "$PERSISTENT" ]; then
MESSAGE="🚨 GitHub Incident Still Active\n\n"
for ID in $PERSISTENT; do
INCIDENT=$(echo "$DATA" | jq ".incidents[] | select(.id==\"$ID\")")
NAME=$(echo "$INCIDENT" | jq -r '.name')
STATUS=$(echo "$INCIDENT" | jq -r '.status')
IMPACT=$(echo "$INCIDENT" | jq -r '.impact')
MESSAGE+="• $NAME — $STATUS ($IMPACT)\n"
done
MESSAGE+="\n🔗 https://www.githubstatus.com/"
curl -X POST \
-H "Content-Type: application/json" \
-d "$(jq -n --arg text "$MESSAGE" '{text: $text}')" \
"$SLACK_WEBHOOK_URL"
else
echo "Incident detected, but waiting for confirmation in next iteration."
fi
# Update baseline for next run
mv current_incidents.txt previous_incidents.txt
git config user.name "github-actions[bot]"
git config user.email "github-actions[bot]@users.noreply.github.com"
git add previous_incidents.txt
git commit -m "Update incident state" || true
git push || true
The Result
Since adding this:
Instead, the team now starts every incident with one simple question:
“Is GitHub healthy right now?”
And more often than not…
That answer saves us 10-15 minutes.
https://github.com/VivekInCloud/Github-Status-Monitor