Scaling incident management for an AI-native organization using Cosmos

As coding and code review become increasingly automated in AI-native organizations, one of the next major bottlenecks is incident management. On-call engineers are often unable to contribute meaningfully to feature development during their rotation, and in small, fast-moving teams, the operational load becomes a direct drag on engineering velocity.

Even with modern observability tooling, incident response still requires engineers to reconstruct operational context under pressure: correlating deploys, searching Slack threads, checking dashboards, identifying ownership, and rediscovering previous incidents. When an alert fires, engineers typically jump between PagerDuty, Slack, dashboards, logs, metrics, GitHub, and prior incidents trying to answer a few core questions:

What actually broke?
Did a recent deploy cause this?
Who owns the affected service?
Is this a known issue?
What should we do next?

Most of this work is repetitive investigative work.

Our goal is to have agents drive the repetitive investigation of incident management while pulling humans in primarily for judgment, prioritization, and remediation decisions.

This article covers how we use the Augment Cosmos platform to automate large parts of incident response directly inside Slack, resulting in an 81% reduction in human on-call investigation effort.

Cosmos

Earlier this year, we rolled out Cosmos internally: our operating system for agentic software development. Cosmos is purpose-built for automating engineering workflows with long-running experts that can work across your SDLC, collaborate with humans, connect to your tools, and continuously improve over time. Each Cosmos automation comes in the form of an Expert, which has its own prompt, integrations, environment, secrets, event triggers, subscriptions, worker experts, and more. This blog focuses on a Cosmos Expert for incident management.

Why incident response had become a bottleneck

After solving the coding and code review bottlenecks a few months ago (via Cosmos), Augment engineers started moving significantly faster on feature development. But on-call was still a major operational bottleneck.

In practice, on-call rotations consistently pulled engineers away from roadmap work for alert triage and incident investigation. This was especially painful in our small, fast-moving teams (typically 2–5 engineers), where newer on-call engineers often escalated incidents to senior engineers for additional context or validation.

Even when alerts turned out to be transient failures, noisy false positives, or incidents that auto-resolved quickly, the investigation still consumed substantial engineering time. A typical engineer spent 30 minutes actively triaging an incident, while newer engineers took even more. On-call engineers were interrupted 5 times/day, and senior engineers were pulled into 20% of alerts.

That was the bottleneck we wanted to remove: not human judgment, but the repetitive investigation required before humans could make good decisions.

How we run incident investigation now

Our primary operational expert is the Incident Investigator. Slack is the natural surface for this expert because it has effectively become the operational control plane for incident response.

Workflow diagram showing how Augment's Incident Investigator Expert handles a PagerDuty alert. The alert posts in a team Slack channel and triggers the Incident Investigator, which triages, investigates, and correlates logs, metrics, deploys, and past incidents to find root cause and recommend actions. It posts an RCA with evidence and next steps in the Slack thread. A human reviews the findings, asks follow-up questions, provides additional context, and approves a remediation path. Four paths branch from the RCA: Monitor (alert was a false positive or auto-resolved, no action needed), PR Author Expert (create a PR to fix the issue or tune the alert), Rollback (revert deployment or configuration to a previous good state), and Escalate (hand off to the right team or service owner). All paths converge on a Resolution and Post-Incident Summary, where the Investigator summarizes the resolution in-thread and captures key learnings. A legend in the corner color-codes the steps by category: AI-driven investigation, human judgment, remediation, operational actions, and monitor.

The Incident Investigator runs triage and root cause analysis on every alert, then routes to one of four remediation paths. Humans step in to review the RCA, ask follow-up questions, and approve the action.

The Incident Investigator operates directly inside incident Slack channels. It reacts to PagerDuty alerts, performs structured investigations, and posts an initial RCA (Root Cause Analysis) and recommended remediation action in-thread (code-fix, rollback, escalate, or only monitor), before a human has even looked at the alert. A human scans the RCA, optionally asks follow-up questions and takes a judgement call on whether the remediation action is appropriate. This is the only place a human is involved and for the average alert this takes less than a minute. When a code change is required, the Incident Investigator hands off the code-fix to a PR Author expert, which works with code review experts to autonomously drive the fix toward merge.

The investigator’s report is highly structured. The expert follows a fixed operational workflow covering triage, investigation, communication, remediation recommendation, and post-resolution summarization. The prompt defines operational procedures, escalation rules, scope boundaries, and constraints, while the LLM handles evidence gathering, hypothesis refinement, validation, and communication. In practice, this produces significantly more reliable operational behavior than either free-form agent loops or human on-call engineers.

Below is an example of the initial RCA and recommended fix:

Engineers sometimes continue the Slack thread with follow-up questions/requests for the Incident Investigator such as:

“I think this alert will likely auto-resolve. Is that correct?”
“does this affect tenants other than X?”
“is this related to the previous alert from 30 minutes ago?”
“what if we tune the threshold of the alert to 2 failures per hour?”

Finally, the Incident Investigator posts a resolution summary, captures key learnings in memory, and optionally writes a post-mortem report.

This creates a broader operational loop where agents:

investigate failure root cause and impact
propose remediations
answer questions
generate fixes
write resolution summaries and post-mortems
records learnings from human interaction

while humans remain responsible for judgment calls and production-impacting decisions.

Generating high quality RCAs

The most important piece in getting the Incident Investigator to generate high quality analyses (typically higher quality than an average developer) is to provide it everything an on-call engineer would have.

[ Free report ]

The Agentic SDLC

How teams like Stripe, Ramp, and Uber move from solo coding agents to a coordinated, team-level system.

Download the guide

Tools

The Incident Investigator gathers evidence from logs, metrics, recent deploys, GitHub history, code-context, ownership mappings, and recent alerts on the same channel. Reaching all of that means the Cosmos environment (the VM the expert runs on) needs the right CLIs installed and authenticated through Cosmos secrets: the gcloud CLI for logs and metrics, and kubectl access to the production clusters for live pod health, restarts, and rollout state. Each input the Investigator can't reach is a class of root cause it can't find, so the goal is simply to match what an on-call engineer has at hand.

Context

Now that your Incident Investigator can access logs, metrics, etc. it also needs to know what kinds of queries to make, and we use Agent Skills (Agent Skills Documentation) for logs and metrics analysis. These skills are internal to Augment today. They live in our repo and they define:

how to query observability systems
operational constraints
common query patterns
environment mappings
debugging workflows

For example, our metrics skill wraps GCP Managed Prometheus and exposes structured PromQL querying patterns for request rates, error rates, pod restarts, and deployment health. Similarly, our logs skill wraps GCP Cloud Logging and kubectl workflows for querying production and staging logs, correlating events across pods, and reconstructing incident timelines. This allows you to customize investigation behavior, add organization-specific workflows, or swap in alternative observability stacks entirely. A common pattern is also ‘Runbooks as code’ (i.e. incident runbooks stored in the codebase), and some teams at Augment use them for team-specific behaviors.

Memory

The final piece that really gets the Incident Investigator operating at the level of your senior engineers is memory. As the expert interacts with humans, it records important tribal knowledge that isn’t documented anywhere, and over time this fills in the knowledge gaps that humans missed recording in Agent Skills or Runbooks (because let’s be honest - many teams drop the ball on documentation).

Another benefit of adopting Cosmos for your SDLC: Memory is shared between Incident Investigators and other experts (such as Code Review), meaning that learnings from one will propagate to the other, resulting in higher overall software quality.

Operational Impact

We analyzed our incident response data for a month before and after deploying the Incident Investigator across five on-call channels, and we summarize the effect on two key metrics: reduction in developer effort on incident response and time to resolution.

On-call developer effort for incident response:

The shift in who does the incident response is the most striking:

Stacked horizontal bar chart showing the share of incidents handled manually versus by agents. Before deploying the agent team: 99.6 percent handled manually, 0.4 percent by agents. After: 18.7 percent handled manually, 81.3 percent by agents.

Agents now handle 81.3% of incidents, up from 0.4% before deployment.

Before Cosmos, nearly every incident had a human doing the initial investigation and coming up with remediation actions (some pulled in interactive coding agents to help investigate different facets, but most of the analysis was done by humans). After, fewer than one in five did: an ~81% reduction in human on-call work. All this freed developer time results in faster velocity: our data shows a 44% increase in merged PRs/week for on-call engineers.

Faster RCA means faster resolution:

After deploying Cosmos, Median time-to-first-RCA fell from 30.1 minutes to 6.2 minutes. The bot can begin working immediately, whereas humans have context switching latency. This significantly impacts how long it takes to get to a final resolution: Median time to resolution (MTTR) dropped from 29.5 to 19.9 minutes. (MTTR can be lower than Time to first RCA because alerts on transient errors auto-resolve even before an RCA is complete)

Grouped vertical bar chart showing median resolution times before and after deploying the incident management agent team. Median time to first root-cause analysis: 30.1 minutes before, 6.2 minutes after. Median time to full thread resolution: 29.5 minutes before, 19.9 minutes after.

Time to first RCA fell from 30 minutes to 6, and full thread resolution dropped by a third.

What the metrics don’t capture:

It is difficult to quantify RCA quality, but internal surveys show that on average the Incident Investigator’s RCA is correct more often than an on-call engineer. This is because it consistently executes ALL the steps of an Incident runbook and performs a deep analysis on every single alert, while a human engineer’s RCA quality varies drastically across individuals.

How you can adopt this

One of the goals behind Cosmos is making out-of-the-box workflows like incident management easy to adopt, so that users don’t have to reinvent the wheel of agent orchestration or hill-climb on agent quality.

The same workflow we use internally can be created directly through the Cosmos Advisor with prompts like:

“Set up an incident investigator expert for me.”

While our production setup uses Slack, PagerDuty, GCP Cloud Logging, Managed Prometheus, and GitHub, the architecture itself is intentionally modular. Different observability stacks can be swapped in while preserving the same overall operational workflow, and the Cosmos Advisor will guide you through customizing the expert for your observability stack.

The important part is not the specific vendor tooling. It is building operational experts that can gather evidence systematically, operate within bounded scope, preserve context, and lead the incident response, while pulling in humans for judgement calls.

Scaling incident management for an AI-native organization using Cosmos

Cosmos

Why incident response had become a bottleneck

How we run incident investigation now

Generating high quality RCAs

The Agentic SDLC

Operational Impact

How you can adopt this

Written by

Akshay Utture

Sophie Reynolds

Sam Chow

Give your codebase the agents it deserves