As coding and code review become increasingly automated in AI-native organizations, one of the next major bottlenecks is incident management. On-call engineers are often unable to contribute meaningfully to feature development during their rotation, and in small, fast-moving teams, the operational load becomes a direct drag on engineering velocity.
Even with modern observability tooling, incident response still requires engineers to reconstruct operational context under pressure: correlating deploys, searching Slack threads, checking dashboards, identifying ownership, and rediscovering previous incidents. When an alert fires, engineers typically jump between PagerDuty, Slack, dashboards, logs, metrics, GitHub, and prior incidents trying to answer a few core questions:
- What actually broke?
- Did a recent deploy cause this?
- Who owns the affected service?
- Is this a known issue?
- What should we do next?
Most of this work is repetitive investigative work.
Our goal is to have agents drive the repetitive investigation of incident management while pulling humans in primarily for judgment, prioritization, and remediation decisions.
This article covers how we use the Augment Cosmos platform to automate large parts of incident response directly inside Slack, resulting in an 81% reduction in human on-call investigation effort.
Cosmos
Earlier this year, we rolled out Cosmos internally: our operating system for agentic software development. Cosmos is purpose-built for automating engineering workflows with long-running experts that can work across your SDLC, collaborate with humans, connect to your tools, and continuously improve over time. Each Cosmos automation comes in the form of an Expert, which has its own prompt, integrations, environment, secrets, event triggers, subscriptions, worker experts, and more. This blog focuses on a Cosmos Expert for incident management.
Why incident response had become a bottleneck
After solving the coding and code review bottlenecks a few months ago (via Cosmos), Augment engineers started moving significantly faster on feature development. But on-call was still a major operational bottleneck.
In practice, on-call rotations consistently pulled engineers away from roadmap work for alert triage and incident investigation. This was especially painful in our small, fast-moving teams (typically 2–5 engineers), where newer on-call engineers often escalated incidents to senior engineers for additional context or validation.
Even when alerts turned out to be transient failures, noisy false positives, or incidents that auto-resolved quickly, the investigation still consumed substantial engineering time. A typical engineer spent 30 minutes actively triaging an incident, while newer engineers took even more. On-call engineers were interrupted 5 times/day, and senior engineers were pulled into 20% of alerts.
That was the bottleneck we wanted to remove: not human judgment, but the repetitive investigation required before humans could make good decisions.
How we run incident investigation now
Our primary operational expert is the Incident Investigator. Slack is the natural surface for this expert because it has effectively become the operational control plane for incident response.

The Incident Investigator runs triage and root cause analysis on every alert, then routes to one of four remediation paths. Humans step in to review the RCA, ask follow-up questions, and approve the action.
The Incident Investigator operates directly inside incident Slack channels. It reacts to PagerDuty alerts, performs structured investigations, and posts an initial RCA (Root Cause Analysis) and recommended remediation action in-thread (code-fix, rollback, escalate, or only monitor), before a human has even looked at the alert. A human scans the RCA, optionally asks follow-up questions and takes a judgement call on whether the remediation action is appropriate. This is the only place a human is involved and for the average alert this takes less than a minute. When a code change is required, the Incident Investigator hands off the code-fix to a PR Author expert, which works with code review experts to autonomously drive the fix toward merge.
The investigator’s report is highly structured. The expert follows a fixed operational workflow covering triage, investigation, communication, remediation recommendation, and post-resolution summarization. The prompt defines operational procedures, escalation rules, scope boundaries, and constraints, while the LLM handles evidence gathering, hypothesis refinement, validation, and communication. In practice, this produces significantly more reliable operational behavior than either free-form agent loops or human on-call engineers.
Below is an example of the initial RCA and recommended fix:

Engineers sometimes continue the Slack thread with follow-up questions/requests for the Incident Investigator such as:
- “I think this alert will likely auto-resolve. Is that correct?”
- “does this affect tenants other than X?”
- “is this related to the previous alert from 30 minutes ago?”
- “what if we tune the threshold of the alert to 2 failures per hour?”
Finally, the Incident Investigator posts a resolution summary, captures key learnings in memory, and optionally writes a post-mortem report.
This creates a broader operational loop where agents:
- investigate failure root cause and impact
- propose remediations
- answer questions
- generate fixes
- write resolution summaries and post-mortems
- records learnings from human interaction
while humans remain responsible for judgment calls and production-impacting decisions.
Generating high quality RCAs
The most important piece in getting the Incident Investigator to generate high quality analyses (typically higher quality than an average developer) is to provide it everything an on-call engineer would have.
Tools
The Incident Investigator gathers evidence from logs, metrics, recent deploys, GitHub history, code-context, ownership mappings, and recent alerts on the same channel. So we need to ensure that the Cosmos environment (i.e. VM) has the required tools installed to access logs/metrics (eg. gcloud cli) and authentication (eg. access tokens) setup via Cosmos secrets.
Context
Now that your Incident Investigator can access logs, metrics, etc. it also needs to know what kinds of queries to make, and we use Agent Skills (Agent Skills Documentation) for logs and metrics analysis. These skills are internal to Augment today. They live in our repo and they define:
- how to query observability systems
- operational constraints
- common query patterns
- environment mappings
- debugging workflows
For example, our metrics skill wraps GCP Managed Prometheus and exposes structured PromQL querying patterns for request rates, error rates, pod restarts, and deployment health. Similarly, our logs skill wraps GCP Cloud Logging and kubectl workflows for querying production and staging logs, correlating events across pods, and reconstructing incident timelines. This allows you to customize investigation behavior, add organization-specific workflows, or swap in alternative observability stacks entirely. A common pattern is also ‘Runbooks as code’ (i.e. incident runbooks stored in the codebase), and some teams at Augment use them for team-specific behaviors.
Memory
The final piece that really gets the Incident Investigator operating at the level of your senior engineers is memory. As the expert interacts with humans, it records important tribal knowledge that isn’t documented anywhere, and over time this fills in the knowledge gaps that humans missed recording in Agent Skills or Runbooks (because let’s be honest - many teams drop the ball on documentation).
Another benefit of adopting Cosmos for your SDLC: Memory is shared between Incident Investigators and other experts (such as Code Review), meaning that learnings from one will propagate to the other, resulting in higher overall software quality.
Operational Impact
We analyzed our incident response data for a month before and after deploying the Incident Investigator across five on-call channels, and we summarize the effect on two key metrics: reduction in developer effort on incident response and time to resolution.
On-call developer effort for incident response:
The shift in who does the incident response is the most striking:

Agents now handle 81.3% of incidents, up from 0.4% before deployment.
Before Cosmos, nearly every incident had a human doing the initial investigation and coming up with remediation actions (some pulled in interactive coding agents to help investigate different facets, but most of the analysis was done by humans). After, fewer than one in five did: an ~81% reduction in human on-call work. All this freed developer time results in faster velocity: our data shows a 44% increase in merged PRs/week for on-call engineers.
Faster RCA means faster resolution:
After deploying Cosmos, Median time-to-first-RCA fell from 30.1 minutes to 6.2 minutes. The bot can begin working immediately, whereas humans have context switching latency. This significantly impacts how long it takes to get to a final resolution: Median time to resolution (MTTR) dropped from 29.5 to 19.9 minutes. (MTTR can be lower than Time to first RCA because alerts on transient errors auto-resolve even before an RCA is complete)

Time to first RCA fell from 30 minutes to 6, and full thread resolution dropped by a third.
What the metrics don’t capture:
It is difficult to quantify RCA quality, but internal surveys show that on average the Incident Investigator’s RCA is correct more often than an on-call engineer. This is because it consistently executes ALL the steps of an Incident runbook and performs a deep analysis on every single alert, while a human engineer’s RCA quality varies drastically across individuals.
How you can adopt this
One of the goals behind Cosmos is making out-of-the-box workflows like incident management easy to adopt, so that users don’t have to reinvent the wheel of agent orchestration or hill-climb on agent quality.
The same workflow we use internally can be created directly through the Cosmos Advisor with prompts like:
“Set up an incident investigator expert for me.”
While our production setup uses Slack, PagerDuty, GCP Cloud Logging, Managed Prometheus, and GitHub, the architecture itself is intentionally modular. Different observability stacks can be swapped in while preserving the same overall operational workflow, and the Cosmos Advisor will guide you through customizing the expert for your observability stack.
The important part is not the specific vendor tooling. It is building operational experts that can gather evidence systematically, operate within bounded scope, preserve context, and lead the incident response, while pulling in humans for judgement calls.
Written by

Akshay Utture
Akshay Utture builds intelligent agents that make software development faster, safer, and more reliable. At Augment Code, he leads the engineering behind the company’s AI Code Review agent, bringing research-grade program analysis and modern GenAI techniques together to automate one of the most time-consuming parts of the SDLC. Before Augment, Akshay spent several years at Uber advancing automated code review and repair systems, and conducted research across AWS’s Automated Reasoning Group, Google’s Android Static Analysis team, and UCLA. His work sits at the intersection of AI, software engineering, and programming-language theory.

Sophie Reynolds
Sophie Reynolds is a Software Engineer at Augment Code. She joined full-time in early 2025 after a Codepoint Fellowship with Sutter Hill Ventures, which placed her in six-month rotations at Observe and Augment. She holds a Bachelor's in Computer Science from MIT.

Sam Chow
Sam Chow is a Member of Technical Staff at Augment Code, where he works on building AI experts to handle everything in the SDLC from ticket to code review.
He joined from Atlassian, where he led growth and onboarding experimentation for Loom.
Before that, he worked across the stack at DoorDash, Mixpanel, CoreOS, and OpenDNS.
