AWS DevOps Agent
On March 31, 2026, AWS quietly made one of the biggest moves in SRE history. They shipped an AI agent that auto-investigates production incidents, identifies root causes with 94% accuracy, and reduces MTTR by 75%. It costs $30/hour only when it's working. It integrates with PagerDuty, Splunk, Datadog, Grafana, New Relic, GitHub, GitLab, and even Azure DevOps. This isn't a chatbot. It's an autonomous SRE teammate. Here's everything you need to know.
By Poojitha A S | #AWS #DevOpsAgent #SRE #AIOps #IncidentResponse #DevOpsMadeSimple
What Is AWS DevOps Agent?
AWS DevOps Agent is a generative AI-powered operations agent that does three things: investigates incidents automatically, prevents future incidents proactively, and handles on-demand SRE tasks across AWS, Azure, and on-prem environments.
When a CloudWatch alarm fires, or a PagerDuty alert triggers, or a ServiceNow ticket lands the agent starts investigating immediately. No human prompting. No waiting for someone to wake up. It pulls logs, checks recent deployments, diffs code changes, correlates metrics, and tells you what broke and why.
This isn't a dashboard that shows you data. It's an agent that does the investigation for you.
How It Actually Works
The architecture is elegant. Understanding it helps you decide if this fits your stack.
CONCEPT 1: Agent Spaces
An Agent Space is a logical container that defines what the agent can access. Think of it like a namespace for your AI SRE. You create Agent Spaces based on your operational model one per team, one per application, or one per environment.
Each Agent Space contains your AWS account configurations, third-party tool integrations, and access permissions. The agent only sees what you put in the space. No accidental access to production secrets from a staging investigation.
CONCEPT 2: Topology Building
When you set up an Agent Space, AWS DevOps Agent automatically maps your infrastructure. It discovers resources, their relationships, and dependencies. Load balancer → target group → ECS service → RDS database → S3 bucket. The agent understands this chain before an incident even happens.
Why does this matter? When a database connection pool exhausts, the agent doesn't just tell you "RDS connections are high." It traces upstream: "Your payment-service pods are leaking connections because a code change in commit a3f7b2c removed the connection timeout, deployed 47 minutes ago via GitHub Actions."
CONCEPT 3: Three Modes of Operation
The agent operates in three modes, each solving a different SRE problem:
1. Investigations (Incident Response)
An alarm fires. The agent starts immediately. It pulls CloudWatch logs, queries Datadog metrics, checks GitHub for recent commits, scans CI/CD pipeline history, and correlates everything. Within minutes, it delivers: what broke, when it broke, what changed, and a suggested action plan. All evidence-backed. All human-verifiable.
# What the agent does automatically when your pager fires:
1. Receives alert (CloudWatch / PagerDuty / Dynatrace / webhook)
2. Identifies affected services from topology map
3. Pulls relevant logs (last 30 min around incident start)
4. Checks deployment history (what shipped recently?)
5. Diffs code changes in recent commits
6. Correlates metrics across dependent services
7. Generates root cause hypothesis with evidence
8. Suggests remediation action plan
9. Posts findings to Slack / ServiceNow
2. Evaluations (Incident Prevention)
Between incidents, the agent analyzes your historical patterns. It looks at past investigations, recurring issues, and infrastructure trends. Then it flags problems before they become incidents: "This ECS task definition has a memory limit that's been hit 3 times in the last 2 weeks. Recommend increasing from 512MB to 1024MB."
3. On-Demand SRE Tasks (Chat)
Ask it anything. "What's the current health of my payment service?" "Show me the top 5 error-producing endpoints this week." "Create a custom chart of p99 latency for the last 7 days." It responds with data, charts, and actionable insights — not generic suggestions.
3 Modes: Investigate → Evaluate → On-Demand
What It Integrates With
This is where it gets serious. AWS DevOps Agent isn't AWS-only. It integrates with your entire stack:
CategoryIntegrations, Observability, CloudWatch, Datadog, Dynatrace, Grafana, New Relic, Splunk, Prometheus, Alerting PagerDuty, CloudWatch Alarms, Dynatrace Problems, ServiceNow, Source Control GitHub, GitLab, Azure DevOps CommunicationSlack, ServiceNowInfrastructureAWS (native), Azure (GA), on-prem (via MCP) ExtensibilityMCP (Model Context Protocol) for custom tools
Multicloud is real: The GA launch added Azure and on-prem support. If your company runs workloads across AWS and Azure, one DevOps Agent can investigate incidents across both. That's a first for any cloud provider's AI ops tool.
MCP: The Extensibility Play
This is the feature that separates DevOps Agent from a fancy dashboard. Model Context Protocol (MCP)lets you connect any internal tool to the agent.
Got a homegrown feature flag service? Write an MCP server wrapper. Now the agent can check feature flag states during investigations. Got a legacy deployment system with a weird API? MCP wrapper. Got an internal runbook database? MCP wrapper.
The agent discovers MCP capabilities at runtime. You register the server, define what it can do, and the agent calls it when relevant during an investigation.
# Example: MCP server for custom deploy tracker
{
"name": "internal-deploy-tracker",
"description": "Checks deployment history from internal system",
"tools": [
{
"name": "get_recent_deploys",
"description": "Returns deployments in the last N hours",
"inputSchema": {
"type": "object",
"properties": {
"service_name": { "type": "string" },
"hours": { "type": "number", "default": 24 }
}
}
}
]
}
Security warning: AWS recommends only exposing read-only MCP tools. Write operations introduce prompt injection risks. Your MCP server should query data, not modify infrastructure. Let humans make the changes.
The Pricing: $30/Hour But Only When Working
Here's the pricing model that every SRE leader needs to understand:
$0.0083 Per agent-second of active work (~$29.88/hour)
You only pay when the agent is actively working. It's billed per second. If the agent investigates an incident for 12 minutes, you pay $5.98. If it sits idle for 23 hours, you pay $0. No upfront commitments. No monthly minimums.
Three billable activities:
AWS Support Credits
If you already pay for AWS Support, you get credits:
Support TierMonthly CreditUnified Operations100% of support chargeEnterprise Support75% of support chargeBusiness Support+30% of support charge
If your Enterprise Support bill is $20K/month, you get $15K in DevOps Agent credits. That's 500 hours of agent time per month — essentially free for most teams.
Free Trial
New customers get a 2-month free trial: 10 agent spaces, 20 hours of investigations, 15 hours of evaluations, and 20 hours of on-demand tasks per month. That's enough to prove the value before committing budget.
Cost math that matters: An average production incident takes an SRE 2-4 hours to investigate manually. At a senior SRE salary of $180K/year (~$90/hour fully loaded), that's $180-$360 per incident. The agent does it in 12-20 minutes for $6-$10. If you have 10 incidents per month, the math isn't close.
The Real Question: Does It Replace SREs?
No. And here's why.
The agent investigates. It correlates. It hypothesizes. But it doesn't fix anything. It doesn't push code. It doesn't roll back deployments. It doesn't make judgment calls about whether to wake up the VP of Engineering at 3 AM.
What it does is eliminate the first 45 minutes of every incident — the "what's going on?" phase where you're pulling logs, checking dashboards, asking teammates if they deployed something. The agent hands you the answer. You decide what to do with it.
The SRE role shifts from "detective + fixer" to "decision-maker + fixer." You skip the investigation. You start at the solution.
45 min Average investigation time eliminated per incident
Getting Started: Your First Agent Space
STEP 1: Activate the Free Trial
Go to the AWS Console → search "DevOps Agent" → activate the free trial. No credit card beyond your existing AWS account.
STEP 2 Create Your First Agent Space
Name it after your team or application. Add your AWS account. The agent immediately starts building your infrastructure topology — discovering resources and mapping dependencies.
STEP 3 Connect Your Observability Stack
Add integrations one at a time. Start with CloudWatch (already connected if you're on AWS). Then add your primary observability tool — Datadog, Splunk, Grafana, or New Relic. Each integration takes minutes.
STEP 4 Connect Your Alert Source
Hook up PagerDuty or your existing alarm system via webhook. Now when an incident fires, the agent starts investigating automatically. This is the moment it becomes real.
# PagerDuty webhook configuration
# In your Agent Space settings:
Event Source: PagerDuty
Webhook URL: (auto-generated by AWS)
Events: incident.triggered, incident.acknowledged
# CloudWatch Alarm integration
# Already native — just select which alarms trigger investigations
STEP 5 Connect Source Control
Link GitHub or GitLab. This is critical — the agent needs to see recent commits and CI/CD pipeline runs to correlate deployments with incidents. Without this, it can tell you what broke but not why.
STEP 6 Test With a Real Incident
Don't wait for production to break. Trigger a test alert manually. Watch the agent investigate. Read the findings. Evaluate the accuracy. Tune your integrations based on what it found (or missed).
Your Action: Activate the 2-month free trial today. Create one Agent Space for your most critical production application. Connect CloudWatch + one observability tool + GitHub. Wait for the next incident or trigger a test alert. Read the agent's investigation report. If 94% accuracy holds for your environment, present the ROI math to your manager this week: $6 per investigation vs. $300 per manual investigation. The numbers sell themselves.
The Takeaway
AWS DevOps Agent is the first cloud-native AI agent purpose-built for SRE work that actually ships in production. it's billed per second, and it integrates with everything you already use.
The 75% MTTR reduction isn't marketing. It's the result of eliminating the investigation phase . the part where humans stare at dashboards, grep through logs, and ask "did anyone deploy something?" The agent does that in minutes, with 94% accuracy, for $6.
Start the free trial. Run it alongside your existing process. Compare the agent's findings with your team's investigations. Let the data decide.
© 2026 DevOps Made Simple Newsletter. All rights reserved.