Skip to content
Book demo

Incident Management

When agents write and review the code, on-call becomes the bottleneck.

Modern observability still makes engineers reconstruct context under pressure.

With Cosmos, agents drive the repetitive investigation of incident management while pulling humans in primarily for judgment, prioritization, and remediation decisions.

# incidents-prod
PagerDutyApp9:52 AM

Condition “Completions API, HTTP 5xx rate” was over the threshold for the last 5m

💨 Urgency: Low

🌐 Service: Completions API

↻ ReopenMore actions ⌄
5 replies
Augment-StagingApp9:53 AM

🔍 investigating

Incident Investigator · View report

9:58

Incident Summary:

Completions API streaming began returning HTTP 500s at 9:42, ~2 min after deploy b4e2f7. Error rate rose from ~0.1% to 4.2%.

Likely Cause:

Deploy b4e2f7 cut the upstream gRPC deadline 30s → 5s; slow responses now exceed it.

Recommended action:Rollback · deploy b4e2f7

Meet Cosmos

An expert on every alert.

The Incident Investigator triages every alert, correlates the evidence, and posts a structured RCA in Slack before a human has even looked. You stay in the loop only for the judgment calls.

Measured on Augment’s own on-call

81%
Less human on-call investigation
6.2 min
Median time to first RCA
44%
More PRs merged by on-call engineers
19.9 min
Median time to resolution

A fleet of experts takes ownership of incident response.

Cosmos orchestrates the Incident Investigator across triage, evidence-gathering, root-cause analysis, and remediation, handing code-fixes to a PR Author that works with the review experts. Each loop is a Cosmos Expert, and they all share memory.

Incident lifecycleone alert, end to end
Triage

Incident Investigator

Reacts to the PagerDuty alert in-thread, then classifies urgency, scope, and the affected service.

Investigate

Evidence Gathering

Pulls logs, metrics, recent deploys, GitHub history, ownership, and prior incidents on the channel.

Root cause

Root-Cause Analysis

Forms and validates hypotheses, then posts a structured RCA: summary, cause, evidence, owners.

Remediate

Remediation

Recommends a code-fix, rollback, escalation, or monitor action, then hands code-fixes to the PR Author.

Human

Scans the RCA and makes the call in under a minute for the average alert.

RCA posted · incident resolved

Incident Memory

Captures tribal knowledge · Distills runbooks · Shared with Code Review

Fig 1 · Incident response fleet

Where humans stay in the loop

Agents investigate. Humans decide.

The bottleneck was never judgment. It was the repetitive investigation before anyone could make a call. Cosmos drives that work and pulls you in only for the decisions that touch production.

Driven by agents

The repetitive investigation

  • Correlating deploys, logs, and metrics
  • Searching Slack threads and dashboards
  • Finding ownership and prior incidents
  • Drafting the RCA and next steps
Owned by humans

The judgment calls

  • Prioritization under pressure
  • Approving the remediation
  • Production-impacting decisions

Measured on our own on-call.

We deployed the Incident Investigator across five on-call channels and compared the month before and after. Human investigation dropped, RCAs landed faster, and on-call engineers got their focus back.

Incident ownership[ fig. 01 / ownership ]
Share of incidents handled by agentsManual vs. agent-led, before and after
Before
0.4% agents
99.6% handled manually.
After
81.3% agents
18.7% still needed manual handling.
Time to resolve[ fig. 02 / speed ]
Median time to resolveTime to first RCA and MTTR, in minutes
First RCA
6.2 min
Down from 30.1 minutes before Cosmos.
MTTR
19.9 min
Down from 29.5 minutes before Cosmos.
Human effort[ fig. 03 / effort ]
Human time per incidentActive triage time, before and after
Before
30 min
Active human triage per incident.
After
<1 min
Active human triage per incident.
SlackPagerDutyGCP Logging · PrometheusSAML / OIDC / SCIM

Highly customizable to your stack.

Talk to Cosmos Advisor to tailor the Incident Investigator: runbooks, escalation rules, scope boundaries, and the queries it runs. Swap in your own observability stack while keeping the same operational workflow.

Slack and PagerDuty. GCP Cloud Logging and Managed Prometheus. GitHub. SAML, OIDC, SCIM. Audit logs and SIEM, with self-hosted deployment available.