Can AI agents fully replace on-call engineers today?

AI agents handle routine incidents and surface structured proposals for complex ones. Engineers remain responsible for judgment, approval, and higher-risk decisions.

How do AI agents handle incidents they haven't seen before?

ML-based correlation models flag activity that deviates from learned behavior, and LLMs generate cause hypotheses from unstructured log data. Both handle novel failures better than rule-based systems, which have no pre-existing rules for new failure modes. Accuracy degrades with system complexity.

What happens when the AI agent's root cause analysis is wrong?

Production architectures mitigate this by separating deterministic RCA (dependency graph traversal, metric anomaly detection) from generative interpretation, treating LLM outputs as hypotheses. Capability contracts limit blast radius by scoping what actions an agent can take even with incorrect analysis.

Does AI incident response create new operational overhead?

Yes. Agents remove some response work, but new categories appear: supervising agent output quality, maintaining models, operating the infrastructure beneath the agents, and reviewing decisions that previously did not require review. Teams should budget for this overhead when estimating ROI.

What maturity level does my organization need before adopting AI incident response?

Organizations need telemetry covering the services under investigation, data agents can query consistently, and operating practices such as approval boundaries and rollback plans. DORA 2025 found AI adoption has a negative relationship with stability without mature controls such as automated testing, version control, and fast feedback loops.

AI Agent Incident Response: From Alert to Fix

The AI agent incident response approach is a multi-agent automation pipeline. Specialized agents coordinate detection, triage, root cause analysis, remediation proposals, and human escalation with scoped responsibilities.

TL;DR

On-call engineers spend 30+ minutes stitching context across fragmented tools, while ignored alerts still cause outages. Conventional response fails because one human remains the integration layer. In Augment's own internal use, a Cosmos Incident Investigator Expert cut human on-call investigation effort by roughly 81%. Specialized agents pass triage, investigation, and escalation work forward with capability contracts and shared memory.

On-call incident response breaks down when engineers must move across PagerDuty, Slack, Grafana, CloudWatch, GitHub, deployment history, and past incidents before they can decide whether an alert is real. That tool-hopping problem shows up first in triage, where alert fatigue and outage risk compound.

AI agent incident response changes that operating model by splitting triage, investigation, and escalation across specialized stages. The sections below explain how the five-stage pipeline works, how alert triage and cross-service root cause analysis improve signal quality, and where supervised automation still needs human governance.

Augment Cosmos, the unified cloud agents platform, runs long-running Experts across the software development lifecycle.

[ Free report ]

The Agentic SDLC

How teams like Stripe, Ramp, and Uber move from solo coding agents to a coordinated, team-level system.

Download the guide

Why On-Call Incident Response Remains a Manual Workflow

Manual on-call incident response persists because one engineer still has to aggregate signals across multiple tools before any decision can be made. That work adds delay and coordination overhead to every incident.

The Google SRE Workbook caps sustainable on-call load at no more than two incidents per shift, enough time to follow up on each one. Observability surveys show what happens when alert volume, tool fragmentation, and response coordination outgrow the people assigned to manage them.

The Splunk State of Observability 2025 survey quantifies the damage through three findings:

43% spend too much time responding to alerts
73% have experienced outages caused by ignored or suppressed alerts
20% often or always convene multi-team war rooms until an issue resolves

The LeadDev Engineering Leadership Report 2025, surveying 617 engineering leaders and developers, reported rising engineering burnout. Layoffs and expanded scope compound on-call load when headcount decreases but system complexity stays constant.

Three structural problems explain why manual response persists despite this investment:

Problem	Mechanism
Swivel-chair workflow	Engineers manually switch between monitoring, logging, tracing, and communication tools to aggregate context. Each context switch adds minutes to resolution
Static threshold scaling	Thresholds calibrated for smaller environments generate exponentially more noise as systems grow, without automated recalibration
Coordination overhead	Traditional on-call tooling makes engineers find who is on call, open dashboards, and assemble responders before troubleshooting even begins

Cosmos's Incident Investigator Expert records and routes incident evidence across tools. This removes much of the manual tool-switching that slows triage.

What an AI Agent Incident Response Pipeline Does

An AI agent incident response pipeline structures the incident lifecycle into five distinct stages. Specialized agents pass evidence and decisions forward, so the on-call engineer does not have to juggle every step at once.

Stage 1: Detection. Event-driven triggers initiate incident workflows from monitoring systems. No manual kickoff required.

Stage 2: Triage. An agent deduplicates alerts, correlates related signals across services, and classifies severity.

Stage 3: Investigation. An investigation agent pulls logs, queries metrics, traces service dependencies, and correlates recent deployments with alert onset times.

Stage 4: Remediation proposal. An agent proposes a fix, such as a rollback, restart, or config change. The proposal includes the rationale for the action.

Stage 5: Escalation. For high-risk decisions, the system routes to a human after previous agents assemble the context. The on-call engineer receives a structured proposal with the relevant evidence already included.

The five-stage pipeline directly addresses the swivel-chair problem: each agent owns one domain and passes structured findings to the next.

Alert Triage: Separating Noise from Signal

Alert triage through AI agents turns raw monitoring events into incidents that engineers can act on. The process normalizes, deduplicates, suppresses, correlates, enriches, and routes signals before they reach the on-call engineer.

Step 1: Ingest and normalize. The triage agent collects raw events from monitoring tools and normalizes them to a common schema. This creates a consistent basis for deduplication and correlation.

Step 2: Deduplicate. PagerDuty implements deduplication through dedup keys, where alerts sharing a key merge automatically before reaching any on-call engineer.

Step 3: Suppress. Suppression filters non-actionable alert categories before incident creation. PagerDuty's Auto-Pause Incident Notifications temporarily pauses incident notifications for transient alerts. This gives transient alerts time to resolve before PagerDuty creates an incident and notifies responders.

Step 4: Correlate and group. ML models cluster related alerts by time co-occurrence, textual similarity, topological proximity, and historical co-occurrence patterns into single incidents. A single database exhausting connections can trigger latency alerts on Service A, memory alerts on Host B, and error rate alerts on Service C. Without cross-service correlation, these surface as three separate incidents for three separate teams.

Step 5: Enrich. Incidents receive deployment history, service ownership, runbook links, and past similar incidents.

Step 6: Score and route. Anomaly scores and impact assessments determine urgency and routing.

PagerDuty says its ML-based alert grouping uses prior incident data, human interaction, selected fields, textual similarity, and service behavior. The design decision to prioritize context over cleverness reflects that pure statistical grouping misses causally relevant information humans use when triaging.

Runbooks, logs, and traces add incident context beyond alert text and grouping patterns.

Root Cause Analysis Across Distributed Services

Root cause analysis across distributed services identifies likely origin points by traversing service dependencies and causal relationships. This separates upstream causes from downstream symptoms.

Dependency Graph Traversal

Dependency graph traversal follows service relationships in trace-derived graphs to localize likely origin points.

Two graph types are often conflated. A service dependency graph captures which services invoke which other services, and teams can derive it from distributed trace data. A causal graph captures functional dependencies, including how load or latency changes in one service propagate to others. Causal graphs can capture relationships even among collocated services, where infrastructure cotenancy creates correlations that correlation-based systems incorrectly model as service dependencies.

How LLMs Contribute to Root Cause Analysis

LLMs contribute to root cause analysis by interpreting unstructured operational text and generating cause hypotheses. This extends deterministic observability systems with language-level pattern recognition and broader failure interpretation.

LLMs serve two mechanistically distinct roles in production RCA systems:

Semantic log interpretation: Parsing logs, traces, and stack traces as language with structure, intent, and causality, useful for extracting signal from unstructured text that defeats regex-based approaches
Hypothesis generation: Producing cause candidates and failure theories, such as "Payment latency likely caused by Catalog deploy at 14:03 UTC (confidence 0.74)"

RAG retrieves documents; topology graphs understand relationships.

Augment Code's Context Engine provides architectural-level understanding by semantically indexing and mapping entire codebases across 400,000+ files, the kind of cross-repository reasoning that separates AI coding tools for complex codebases from keyword search. It also understands relationships through dependency-style graph analysis. Academic research on the PRAXIS approach uses both service dependency graphs and program dependency graphs to extend RCA beyond observability symptoms and localize faults to relevant code regions and dependency paths.

The core RCA components play different roles, so production systems need them in combination:

Component	Primary role	Best at	Limitation
Service dependency graph	Captures which services invoke which other services	Following service relationships and separating upstream causes from downstream symptoms	Does not by itself capture all functional dependencies
Causal graph	Captures functional dependencies, including how load or latency changes propagate	Modeling causal relationships that simple service maps can miss	Teams can confuse causal graphs with service dependency graphs when they treat correlation as dependency
RAG	Retrieves documents	Supplying textual evidence such as logs, runbooks, and past incident material	Does not define service relationships or blast radius
LLM semantic interpretation	Interprets unstructured operational text	Extracting signal from logs, traces, and stack traces that defeats regex-based approaches	Produces interpretation that still needs deterministic verification
LLM hypothesis generation	Produces cause candidates and failure theories	Generating candidate explanations for likely causes	Outputs should be treated as hypotheses

Engineers evaluating dependency-aware tooling can compare options for mapping tools and breaking changes when deciding how much architectural context incident automation should ingest.

Known Failure Modes

Known failure modes in AI-driven RCA define reliability limits because missing telemetry, sampled data, and unconstrained generation can distort the chain of evidence.

Senior engineers should account for these documented failure modes when evaluating AI-driven RCA:

Failure Mode	Impact
LLM hallucination in unconstrained agent frameworks	High
Uninstrumented services can create gaps in the trace DAG, limiting root-cause analysis to the nearest instrumented upstream caller	High
Traces and logs are sampled independently, so trace-log correlation can refer to a trace that was never ingested	Medium
Some evaluations suggest LLM performance can become harder to assess as tasks and evaluation setups grow more complex	Medium

The technically defensible architecture separates RCA into a deterministic layer and a generative layer. The deterministic layer covers causal graph construction, dependency traversal, and metric anomaly detection. The generative layer covers LLM interpretation, human-readable summaries, and remediation hypotheses. Deterministic analysis produces verifiable findings; generative analysis adds interpretability. SREs should treat outputs from the generative layer as hypotheses, validating them with the same rigor teams apply when they evaluate agent output before automating action.

Cosmos's Incident Investigator applies this deterministic-plus-generative split to your own services, correlating logs, metrics, and recent deploys to localize root cause before a human opens the alert.

Fix Proposals and Human Escalation

Fix proposals and human escalation keep automated remediation useful by bounding action scope and routing higher-risk decisions for review. This limits operational risk in production systems.

Production incident response systems execute a defined set of actions: restarting a hung instance, scaling up a service under load, clearing application cache, collecting diagnostic bundles, and rollback.

When to Auto-Apply vs. Require Human Review

The boundary between auto-apply and human review depends on action scope and confidence. Production guidance places human checkpoints around higher-risk containment and eradication steps.

For compliance-conscious teams, a common recommendation is to place human checkpoints between triage and containment, and again between containment and eradication of production assets.

A practical boundary follows two patterns. Supervised automation means the agent suggests fixes and executes only after human approval, which is appropriate for clearing application cache, rebooting hung instances, scaling services under load, and collecting diagnostic bundles. Conditional full automation means teams gate full automation behind confidence thresholds and action scope.

Safety Guardrails

Safety guardrails constrain automated remediation through explicit operational limits. These limits reduce the chance that a useful fix path becomes a larger production failure.

Six constraints govern automated remediation in production systems:

Idempotence: Automation must be safe to run multiple times without unintended side effects
Context-aware safety constraints: Do not auto-remediate during a major holiday traffic spike without human override
Rollback plans: Every automated action must include a defined rollback path before execution begins
Human-in-the-loop governance: SREs approve or audit critical remediation
Audit logs: When AI triggers remediation, the system maintains an audit trail
Escape hatch: Teams can abort automatic flow and revert to manual mode at any point

AI agents today are best positioned as supervised first-responders that handle a bounded subset of incidents autonomously and surface structured proposals for the rest.

How Cosmos's Incident Investigator Implements This Pipeline

Augment runs its own incident response on Cosmos, inside Slack, through an Expert called the Incident Investigator. A month-over-month internal measurement across five on-call channels showed substantial reductions in human on-call investigation effort.

Open source

augmentcode/augment-swebench-agent★876

Star on GitHub

Agent Architecture

Cosmos assigns incident-response work to specialized roles whose boundaries create clearer operational handoffs for supervised automation.

Cosmos coordinates a set of scoped agents, a triager, the Incident Investigator, a PR Author, a Slack coordinator, and an SRE, under an Incident Coordinator.

Capability Contracts: Architectural Safety

Cosmos capability contracts restrict each agent to declared inputs, outputs, permission scope, and auditable execution paths. Those boundaries reduce operational risk for supervised remediation.

Cosmos defines operational controls through its platform design, granting each agent only scoped access. Teams may grant an incident-response Expert rollback or restore capabilities, but they should not assume that this scope covers unrelated operations such as dropping a database table unless they define and audit that boundary.

Teams can reuse an Expert wrapped in a capability contract across services, retire it without breaking other workflows, and audit each invocation.

Cross-Expert Memory

Cosmos shared memory reduces duplicate investigation effort by preserving findings from human and agent interactions across Experts and future incidents.

Cosmos shares memory between the Incident Investigator and other Experts, including Code Review, so learnings from incident investigations propagate to code review processes. Knowledge from every incident compounds for future incidents.

Integration and Customization

The Cosmos Advisor sets up and customizes an Incident Investigator for your stack from a natural-language prompt such as "Set up an incident investigator expert for me." The Expert itself reacts to PagerDuty alerts inside Slack and gathers evidence from logs, metrics, deploys, and GitHub, so teams already on those tools have less to wire up.

Teams can swap different observability stacks while preserving the same triage, investigation, and escalation flow, so Cosmos works within an existing toolchain with no migration required.

Measured Outcome

Cosmos's Incident Investigator reports an approximately 81% reduction in human on-call investigation effort in Augment's own internal use.

Before deployment, a typical engineer spent about 30 minutes actively triaging a single incident. Augment measured the change by comparing the share of incidents handled by humans versus agents for a month before and after rolling the Expert out across five on-call channels: agents went from handling 0.4% of incidents to 81.3%.

(These figures come from Augment's own internal use across its teams and have not been independently audited. They are not customer outcomes.)

Augment Cosmos supports cloud-agent workflows with Augment Code's SOC 2 Type II attestation and ISO/IEC 42001 certification. It pairs those attestations with scoped permissions, audit trails, and human-in-the-loop controls.

Deploy Supervised Agents Before Your Next On-Call Rotation

Start with supervised agents on one alert class or one low-risk remediation path before the next on-call rotation. That narrow scope lets teams measure how much manual triage time disappears before widening automation.

A narrow starting point is to use agents first for alert noise reduction, evidence gathering, and structured remediation proposals. Teams can expand autonomy after they define approval boundaries, rollback paths, and audit expectations.

Cosmos capability contracts constrain action scope, permissions, and auditability before any automation runs.

AI Agent Incident Response: From Alert to Fix

TL;DR

The Agentic SDLC

Why On-Call Incident Response Remains a Manual Workflow

What an AI Agent Incident Response Pipeline Does

Alert Triage: Separating Noise from Signal

Root Cause Analysis Across Distributed Services

Dependency Graph Traversal

How LLMs Contribute to Root Cause Analysis

Known Failure Modes

Fix Proposals and Human Escalation

When to Auto-Apply vs. Require Human Review

Safety Guardrails

How Cosmos's Incident Investigator Implements This Pipeline

Agent Architecture

Capability Contracts: Architectural Safety

Cross-Expert Memory

Integration and Customization

Measured Outcome

Deploy Supervised Agents Before Your Next On-Call Rotation

FAQ

Written by

Paula Hingel

Give your codebase the agents it deserves

TL;DR

The Agentic SDLC

Why On-Call Incident Response Remains a Manual Workflow

What an AI Agent Incident Response Pipeline Does

Alert Triage: Separating Noise from Signal

Root Cause Analysis Across Distributed Services

Dependency Graph Traversal

How LLMs Contribute to Root Cause Analysis

Known Failure Modes

Fix Proposals and Human Escalation

When to Auto-Apply vs. Require Human Review

Safety Guardrails

How Cosmos's Incident Investigator Implements This Pipeline

Agent Architecture

Capability Contracts: Architectural Safety

Cross-Expert Memory

Integration and Customization

Measured Outcome

Deploy Supervised Agents Before Your Next On-Call Rotation

FAQ

Can AI agents fully replace on-call engineers today?

How do AI agents handle incidents they haven't seen before?

What happens when the AI agent's root cause analysis is wrong?

Does AI incident response create new operational overhead?

What maturity level does my organization need before adopting AI incident response?

Related

Written by

Paula Hingel

Give your codebase the agents it deserves